Monday, June 29, 2015

Fun with Vowpal Wabbit, NLTK and Memorable Movie Quotes


I was at NAACL 2015 earlier (first week of) this month. This was my very first time attending an academic conference, and I found it incredibly interesting. New ideas were unveiled and old benchmarks crushed on a regular basis over the course of the week. Some of the ideas were familiar, others completely new. It was a chance to listen in person to people whose blog posts and textbooks I have read in the past. It was a chance to talk to incredibly smart people who are doing extremely cool things with text (as well as speech and video). I came away with several new ideas, techniques and technologies that I plan to explore in the coming months. As an ex-Denverite, it was also a chance to meet and catch up with old friends.

As part of the conference, I attended a tutorial on Learning to Search (L2S) for Structured Prediction, where there was a tutorial on Vowpal Wabbit (VW) basics (since VW is used in the implementation). I had read many good things about VW on the Kaggle message boards, but hadn't used it so far, so for me, this was also a chance to get familiar with VW.

The keynote for the conference was delivered by Lillian Lee who talked about investigating the effect of language on people. Among the examples she provided, she mentioned her work on memorable movie quotes (in itself an excellent primer on experiment design BTW). The gist of this work is finding memorable movie quotes and finding similar length utterances by the same character at around the same point in the movie that are not memorable, and using them as positive and negative examples to train a classifier that predicts if an utterance is memorable. More details can be found in the paper You had me at hello: How phrasing affects memorability.

The dataset seemed fairly well-defined and simple (deceptively simple, as I would find out later), so I decided to try to build a classifier with VW (and get familiar with VW in the process). This post describes that effort.

VW provides a rich command-line interface (CLI). Recently it also provides a Python based programmatic API but there were some problems getting it to work on Mac OSX (my current platform) during the tutorial. The Python API can be useful when you are using VW as a platform rather than a tool (for example when doing L2S), but I just wanted to do simple classification, so I just focused on the CLI.

The dataset of memorable movie quotes can be found here. It consists of 2,197 memorable movie quotes paired with 2,197 non-memorable ones. Each pair of quotes is organized into 4 lines as shown below (in the file moviequotes.memorable_nonmemorable_pairs.txt):

1
2
3
4
5
wild things
My mother would kill me if she knew I took the Rover!
875814 My mom would kill me if she knew I took the Rover.
875877 I think it was something else... Suzie says the bust was bullshit...
...

The first line is the name of the movie, the second line is the memorable quote, the third line is the memorable quote as it is in the script (not necessarily exactly the same as the quote that became memorable), and the fourth line is a non-memorable utterance by the same character at around the same time in the movie. We first transform this to a format that is easier to work with downstream.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Source: src/moviequotes/preprocess_data.py
# -*- coding: utf-8 -*-

def main():
    fin = open("../../data/moviequotes/moviequotes.memorable_nonmemorable_pairs.txt", 'rb')
    fout = open("../../data/moviequotes/moviequotes.txt", 'wb')
    line_num = 0
    mem_line = None
    nonmem_line = None
    for line in fin:
        line = line.strip()
        if line_num == 1:
            mem_line = line
        if line_num == 3:
            nonmem_line = line[line.index(" ")+1:]
        if len(line) == 0:
            fout.write("%d\t%s\n" % (1, mem_line))
            fout.write("%d\t%s\n" % (0, nonmem_line))
            line_num = 0
            mem_line = None
            nonmem_line = None
        else:
            line_num += 1
    fin.close()
    fout.flush()
    fout.close()
    
if __name__ == "__main__":
    """ Convert Cornell data into easy to read tabular format """
    main()

This transforms the data into the following flat format.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
1 My mother would kill me if she knew I took the Rover!
0 I think it was something else... Suzie says the bust was bullshit...
1 You've been a bad girl, Grace!
0 What the hell you doin', Grace?
1 And a waitress named Flo.
0 You son of a bitch!
1 I bet right now, you don't know if you want to kill me... or fuck me.
0 Women say they don't want to be taken like, really taken -- that's bullshit -- they do.
1 Is everyone in this town on drugs?
0 Well, he offered, and I just couldn't refuse.
...

The next part is to split this up into a training and test set into the format VW wants. We use 70% of the data as our training set and the remaining 30% as the validation set. Our first attempt is to just treat all the text as features. So our data looks like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
1 | My mother would kill me if she knew I took the Rover!
-1 | I think it was something else... Suzie says the bust was bullshit...
1 | You've been a bad girl, Grace!
-1 | What the hell you doin', Grace?
1 | And a waitress named Flo.
-1 | Well, he offered, and I just couldn't refuse.
1 | And here I made you all hot and sweaty...
-1 | And I don't know if you're really giving.
1 | Look, defenseless babies!
-1 | I hate this.
...

We train a VW model (model.vw) using Logistic Regression. VW offers other models such as SVM, but this was the one that gave me the best results.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
$ vw -d train_moviequotes.txt -f model.vw \
     --loss_function logistic --passes 10000 --cache_file cache
final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.693147 0.693147            1            1.0   1.0000   0.0000       13
0.768432 0.843716            2            2.0  -1.0000   0.2814       13
0.707602 0.646772            4            4.0   1.0000   0.0773        7
0.709794 0.711985            8            8.0   1.0000   0.1979        6
0.727054 0.744315           16           16.0  -1.0000  -0.0520        5
0.711858 0.696662           32           32.0  -1.0000  -0.0961       16
0.705705 0.699551           64           64.0  -1.0000  -0.2107        6
0.727612 0.749519          128          128.0  -1.0000   0.3081       17
0.705219 0.682826          256          256.0   1.0000  -0.2075        9
0.689915 0.674610          512          512.0   1.0000   0.3932        7
0.682021 0.674128         1024         1024.0  -1.0000  -0.4391       14
0.675095 0.668168         2048         2048.0  -1.0000  -0.4202        7
0.668805 0.668805         4096         4096.0  -1.0000   0.0315        4 h
0.664234 0.659672         8192         8192.0   1.0000  -0.1119        7 h
0.657264 0.650294        16384        16384.0  -1.0000  -2.8443       12 h

finished run
number of examples per pass = 2784
passes used = 7
weighted example sum = 19488.000000
weighted label sum = -126.000000
average loss = 0.651617 h
best constant = -0.012931
best constant's loss = 0.693126
total feature number = 207620

$ vw test_moviequotes.txt -t -i model.vw -p test_pred.txt
only testing
predictions = test_pred.txt
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
using no cache
Reading datafile = test_moviequotes.txt
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.175001 0.175001            1            1.0  -1.0000  -0.5817        9
11.97682 23.77864            2            2.0  -1.0000  -5.8763       18
6.771072 1.565320            4            4.0   1.0000  -0.4749       13
5.300591 3.830111            8            8.0   1.0000  -1.7914       28
3.082561 0.864532           16           16.0   1.0000   0.8826       12
2.398779 1.714996           32           32.0  -1.0000  -1.4483       18
1.842740 1.286701           64           64.0  -1.0000   0.6553        7
1.602232 1.361725          128          128.0  -1.0000  -0.0691       28
1.632950 1.663667          256          256.0   1.0000   0.6150        5
1.548076 1.463202          512          512.0   1.0000  -0.3059        9
1.552964 1.557851         1024         1024.0  -1.0000   1.4476       11

finished run
number of examples per pass = 1301
passes used = 1
weighted example sum = 1301.000000
weighted label sum = 11.000000
average loss = 1.522137
best constant = 0.008455
best constant's loss = 0.999928
total feature number = 14037

The second call creates a train_pred.txt file containing the predictions that look something like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
1.773244
-2.195071
1.417488
1.006819
-0.361311
0.838351
-1.418754
0.207618
0.272800
1.483417
...

To calculate the accuracy, we compare the values predicted (passing them through a step function so any positive prediction is a 1 and any negative prediction is a -1) with the actual values in our test_moviequotes.txt file.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Source: src/moviequotes/accuracy_calc.py
# -*- coding: utf-8 -*-
from __future__ import division

DATA_FILE = "../../data/moviequotes/%s_moviequotes.txt" 
PRED_FILE = "../../data/moviequotes/%s_pred.txt"

def step(x):
    return -1 if x < 0 else 1
    
def main():
    for m in ["train", "test"]:
        fdata = open(DATA_FILE % m, 'rb')
        label_acts = []
        for line in fdata:
            label_acts.append(int(line.split("|")[0].strip()))
        fdata.close()
        label_preds = []
        fpred = open(PRED_FILE % m, 'rb') 
        for line in fpred:
            label_preds.append(step(float(line.strip())))
        fpred.close()
        num_recs = 0
        num_correct = 0
        for act, pred in zip(label_acts, label_preds):
            if act == pred:
                num_correct += 1
            num_recs += 1
        print("Accuracy for %s: %.3f %%" % (m, 100.0 * num_correct / num_recs))
            
if __name__ == "__main__":
    main()

This gives us a test set accuracy of 58.647%, slightly lower than the result the authors of the paper had with their bag of words model (59.67%). The paper goes on to describe some additional features that they bring their accuracy numbers up to 64.46%, so I tried to replicate similar features for my model to see if I could also get these higher numbers.

The three things the authors noticed about memorable quotes were (a) the use of language in memorable quotes was to some extent unusual (Distinctiveness), (b) they were easier to use outside the specific context in which they were originally used (Generality) and (c) the use of specific language that is "memorable". The first is modeled as the likelihood of a quote against (unigram, bigram and trigram) language models built using the Brown Corpus News sentences. The second is modeled as the use of fewer personal pronouns, more use of indefinite articles and past tense verbs in memorable than in non-memorable quotes. The third was done by constructing separate language models out of the memorable and non-memorable quotes and comparing the likelihood against them.

I implemented the first two extensions in the distinctiveness() and generality() functions in the code below. The code splits the data into a 70/30 split and computes the extra features and writes them out into the data file along with the text.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
# Source: src/moviequotes/vwformat_split.py
# -*- coding: utf-8 -*-
from __future__ import division
from nltk.corpus import brown
import math
import nltk
import random

INPUT_FILE = "../../data/moviequotes/moviequotes.txt"
VW_TRAIN_FILE = "../../data/moviequotes/train_moviequotes.txt"
VW_TEST_FILE = "../../data/moviequotes/test_moviequotes.txt"
TRAIN_TEST_SPLIT = 0.7

def generality(s):
    """
    Counts the number of personal pronouns (PRP), Indefinite articles (a, an)
    and Past Tense (VBD) per word in sentence.
    """
    words = nltk.word_tokenize(s)
    num_words = len(words)
    postags = [x[1] for x in nltk.pos_tag(words)]
    num_prps = len([x for x in postags if x == "PRP"]) / num_words
    num_indef_articles = len([x for x in words 
            if x.lower() == "a" or x.lower() == "an"]) / num_words
    num_vbds = len([x for x in postags if x == "VBD"]) / num_words
    return num_prps, num_indef_articles, num_vbds

def pad_words(words):
    words.insert(0, "_START_")        
    words.append("_END_")
    return words
    
def browncorpus_gram_freqs(gram_size):
    """ 
    Construct a frequency distribution out of news sentences in Brown
    Corpus available in NLTK
    """
    gram_freqs = nltk.FreqDist()
    num_elems = 0
    for sent in brown.sents(categories=["news"]):
        sent = [x.lower() for x in sent]
        sent = pad_words(sent)
        # construct n-grams
        for ngram in nltk.ngrams(sent, 2):
            num_elems += 1
            gram_freqs[ngram] += 1        
    return gram_freqs, num_elems
    
def distinctiveness(s, gram_size, gram_freqs, num_grams):
    """
    Returns probability of the sentence being from Brown Corpus (news)
    based on the probabilities of the n-gram in the sentence being found
    in the Brown corpus.
    """
    words = nltk.word_tokenize(s)
    words = pad_words(words)
    log_prob = 0.0
    for wgram in nltk.ngrams(words, 2):
        p_wgram = (gram_freqs[wgram] + 1) / (num_grams + len(gram_freqs)) 
        log_prob += math.log(p_wgram)
    return log_prob
    
def vw_format(label, text, gram1_freqs, num_grams1, gram2_freqs,
              num_grams2, gram3_freqs, num_grams3):
    vw_label = "1" if int(label) == 1 else "-1"
    prp, ia, pt = generality(text)
    d1 = distinctiveness(text, 1, gram1_freqs, num_grams1)
    d2 = distinctiveness(text, 2, gram2_freqs, num_grams2)
    d3 = distinctiveness(text, 3, gram3_freqs, num_grams3)
    return "%s |s %s|g prp:%.3f ia:%.3f pt:%.3f|d d1:%.3f d2:%.3f d3:%.3f\n" % \
        (vw_label, text, prp, ia, pt, d1, d2, d3)

def main():
    # load up frequency distributions for calculating distinctiveness
    gram1_freqs, num_grams1 = browncorpus_gram_freqs(1)
    gram2_freqs, num_grams2 = browncorpus_gram_freqs(2)
    gram3_freqs, num_grams3 = browncorpus_gram_freqs(3)
    
    fin = open(INPUT_FILE, 'rb')
    ftrain = open(VW_TRAIN_FILE, 'wb')
    ftest = open(VW_TEST_FILE, 'wb')
    for line in fin:
        label, text = line.strip().split("\t")
        # VW can't handle ":" since its a special character
        text = text.replace(":", " ")
        dice = random.random()
        if dice < TRAIN_TEST_SPLIT:
            ftrain.write(vw_format(label, text, gram1_freqs, num_grams1,
                                   gram2_freqs, num_grams2, gram3_freqs,
                                   num_grams3))
        else:
            ftest.write(vw_format(label, text, gram1_freqs, num_grams1,
                                  gram2_freqs, num_grams2, gram3_freqs,
                                  num_grams3))
    ftest.flush()
    ftrain.flush()
    ftest.close()
    ftrain.close()
    fin.close()
    
if __name__ == "__main__":
    main()

This produces a VW datafile that looks like this:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
1 |s My mother would kill me if she knew I took the Rover!|g prp:0.231 ia:0.000 pt:0.154|d d1:-162.664 d2:-162.664 d3:-162.664
-1 |s I think it was something else... Suzie says the bust was bullshit...|g prp:0.143 ia:0.000 pt:0.143|d d1:-172.825 d2:-172.825 d3:-172.825
1 |s You've been a bad girl, Grace!|g prp:0.111 ia:0.111 pt:0.000|d d1:-114.040 d2:-114.040 d3:-114.040
1 |s Your lies are old but you tell them pretty good.|g prp:0.182 ia:0.000 pt:0.000|d d1:-132.371 d2:-132.371 d3:-132.371
1 |s And a waitress named Flo.|g prp:0.000 ia:0.167 pt:0.000|d d1:-75.848 d2:-75.848 d3:-75.848
1 |s Is everyone in this town on drugs?|g prp:0.000 ia:0.000 pt:0.000|d d1:-100.962 d2:-100.962 d3:-100.962
-1 |s Well, he offered, and I just couldn't refuse.|g prp:0.167 ia:0.000 pt:0.083|d d1:-137.264 d2:-137.264 d3:-137.264
1 |s Don't admire people too much, they might disappoint you.|g prp:0.167 ia:0.000 pt:0.000|d d1:-140.809 d2:-140.809 d3:-140.809
-1 |s I mean there's someone besides your mother you've got to forgive.|g prp:0.143 ia:0.000 pt:0.000|d d1:-169.716 d2:-169.716 d3:-169.716
...

With the addition of the generality features (namespace g above), the accuracy on the test set jumped to 61.864% and the addition of the distinctiveness features (namespace d above) it jumped to 62.296%. In comparison, the result reported in the paper after the addition of all 3 feature sets was 64.46%. I decided to stop at this point, since my objective was to learn how to do classification with VW and I had already explored that.

VW offers some interesting new features and algorithms, some of which I don't fully understand yet, but it looks like it may be a useful tool to have in one's Machine Learning toolbox. All the code in this post can be found on GitHub here.