Salmon Run: May 2013

Saturday, May 25, 2013

Feature Selection with Scikit-Learn

I am currently doing the Web Intelligence and Big Data course from Coursera, and one of the assignments was to predict a person's ethnicity from a set of about 200,000 genetic markers (provided as boolean values). As you can see, a simple classification problem.

One of the optimization suggestions for the exercise was to prune the featureset. Prior to this, I had only a vague notion that one could do this by running correlations of each feature against the outcome, and choosing the most highly correlated ones. This seemed like a good opportunity to learn a bit about this, so I did some reading and digging within Scikit-Learn to find if they had something to do this (they did). I also decided to investigate how the accuracy of a classifier varies with the feature size. This post is a result of this effort.

The IR Book has a sub-chapter on Feature Selection. Three main approaches to Feature Selection are covered - Mutual Information based, Chi-square based and Frequency based. Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I learned about this from Matt Spitz's passing reference to Chi-squared feature selection in Scikit-Learn in his Slugger ML talk at Pycon USA 2012.

In the code below, I compute the accuracies with various feature sizes for 9 different classifiers, using both the Chi-squared measure and the ANOVA F measures.

from __future__ import division

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

N_TRAIN_ROWS = 139
N_FEATURES = 204355
N_FOLDS = 10

ethCode = {
  "CEU": 0, "GIH": 1, "JPT": 2, "ASW": 3, "YRI": 4
}

def load(filename, numInstances, numFeatures):
  headerLines = 3
  ftrain = open(filename, 'rb')
  X = np.zeros((numInstances, numFeatures))
  y = np.zeros((numInstances,))
  i = 0
  for line in ftrain:
    i += 1
    if i <= headerLines:
      continue
    line = line.strip()
    cols = line.split("\t")
    y[i - headerLines - 1] = ethCode[cols[0]]
    for j in range(1, len(cols) - 1):
      if cols[j] == "1":
        X[i - headerLines - 1, j] = cols[j]
  ftrain.close()
  return X, y

def evaluate(X, y, nfolds, clf, nfeats, clfname, scoreFunc):
  kfold = KFold(X.shape[0], n_folds=nfolds)
  acc = 0
  i = 0
  print("%s (#-features=%d)..." % (clfname, nfeats))
  for train, test in kfold:
    i += 1
    Xtrain, Xtest, ytrain, ytest = X[test], X[train], y[test], y[train]
    clf.fit(Xtrain, ytrain)
    ypred = clf.predict(Xtest)
    score = accuracy_score(ytest, ypred)
    print "  Fold #%d, accuracy=%f" % (i, score)
    acc += score
  acc /= nfolds
  print "## %s (#-features=%d) accuracy=%f" % (clfname, nfeats, acc)
  return acc

def plot(accuracies, xvals, legends):
  fig = plt.figure()
  ax = fig.add_subplot(111)
  cm = [color + marker
    for color in ["b", "g", "r", "c", "m", "y", "b"]
    for marker in ["o", "D"]]
  for i in range(0, accuracies.shape[0]):
    ax.plot(xvals, accuracies[i, :], color=cm[i][0], 
      marker=cm[i][1], label=legends[i])
  plt.xlabel("#-Features")
  plt.ylabel("Accuracy")
  plt.title("Accuracy vs #-Features for different classifiers")
  ax.set_xscale("log")
  box = ax.get_position()
  ax.set_position([box.x0, box.y0 + box.height * 0.3,
    box.width, box.height * 0.7])
  ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.15),
    fancybox=True, shadow=True, ncol=3)
  plt.show()
  
def main():
  X, y = load("genestrain.tab", N_TRAIN_ROWS, N_FEATURES)
  nFeatures = np.array([N_FEATURES, 50000, 5000, 500, 50, 10])
  clfs = [
    BernoulliNB(),
    MultinomialNB(),
    GaussianNB(),
    DecisionTreeClassifier(),
    RandomForestClassifier(n_estimators=10),
    OneVsRestClassifier(LinearSVC(random_state=0)),
    OneVsRestClassifier(LogisticRegression()),
    OneVsRestClassifier(SGDClassifier()),
    OneVsRestClassifier(RidgeClassifier()),
  ]
  clfnames = map(lambda x: type(x).__name__
    if type(x).__name__ != 'OneVsRestClassifier'
    else type(x.estimator).__name__, clfs)
  scoreFuncs = [chi2, f_classif]
  accuracies = np.zeros((len(clfs), len(nFeatures), len(scoreFuncs)))
  for k in range(0, len(scoreFuncs)):
    Xtrunc = X.copy()
    for j in range(0, len(nFeatures)):
      if nFeatures[j] != N_FEATURES:
        featureSelector = SelectKBest(score_func=scoreFuncs[k], k=nFeatures[j])
        Xtrunc = featureSelector.fit_transform(X, y)
      for i in range(0, len(clfs)):
        accuracies[i, j, k] = evaluate(Xtrunc, y, N_FOLDS, clfs[i],
          nFeatures[j], clfnames[i], scoreFuncs[k])
  # print out accuracy matrix
  for k in range(0, len(scoreFuncs)):
    for i in range(0, len(clfs)):
      print "%22s " % clfnames[i],
      for j in range(0, accuracies.shape[1]):
        print "%5.3f" % accuracies[i, j, k],
      print
    plot(accuracies[:, :, k], nFeatures, clfnames)

if __name__ == "__main__":
  main()

The results and graph of accuracies for different feature sizes on different classifiers for the Chi-squared measure is shown below:

200k  500k    5k   500    50    10
-----------------------------------------------------------
           BernoulliNB  0.332 0.386 0.454 0.525 0.585 0.233
         MultinomialNB  0.383 0.488 0.505 0.519 0.489 0.158
            GaussianNB  0.294 0.427 0.537 0.568 0.585 0.233
DecisionTreeClassifier  0.566 0.571 0.551 0.576 0.582 0.233
RandomForestClassifier  0.380 0.481 0.544 0.534 0.564 0.233
             LinearSVC  0.406 0.554 0.570 0.585 0.585 0.233
    LogisticRegression  0.391 0.512 0.582 0.585 0.585 0.233
         SGDClassifier  0.401 0.557 0.559 0.585 0.585 0.239
       RidgeClassifier  0.403 0.556 0.568 0.568 0.585 0.233

And the same results and graph, but using the ANOVA F measure for feature selection, is shown below:

200k  500k    5k   500    50    10
-----------------------------------------------------------
           BernoulliNB  0.332 0.406 0.556 0.544 0.544 0.501
         MultinomialNB  0.383 0.452 0.585 0.585 0.544 0.455
            GaussianNB  0.294 0.436 0.585 0.585 0.585 0.585
DecisionTreeClassifier  0.566 0.566 0.577 0.585 0.585 0.585
RandomForestClassifier  0.363 0.474 0.556 0.563 0.563 0.585
             LinearSVC  0.406 0.559 0.585 0.585 0.585 0.585
    LogisticRegression  0.391 0.527 0.585 0.585 0.585 0.479
         SGDClassifier  0.401 0.475 0.583 0.585 0.563 0.585
       RidgeClassifier  0.403 0.559 0.585 0.585 0.585 0.585

As you can see (reading right to left on the graph), the accuracy seems to increase as the number of features are pruned, until a point beyond which there seems to be too few features for the classifier to make any reliable conclusions.

Saturday, May 11, 2013

Finding Significant Phrases in Tweets with NLTK

Earlier this week, there was a question about finding significant phrases in text on the Natural Language Processing People (login required) group on LinkedIn. I suggested looking at this LingPipe tutorial. The idea is to find statistically significant word collocations, ie, those that occur more frequently than we can explain away as due to chance. I first became aware of this approach from the LLG Book, where two approaches are described - one based on Log-Likelihood Ratios (LLR) and one based on the Chi-Squared test of independence - the latter is used by LingPipe.

I had originally set out to actually provide an implementation for my suggestion (to answer a followup question). However, the Scipy Pydoc notes that the chi-squared test may be invalid when the number of observed or expected frequencies in each category are too small. Our algorithm compares just two observed and expected frequencies, so it probably qualifies. Hence I went with the LLR approach, even though it is slightly more involved.

The idea is to find, for each bigram pair, the likelihood that the components are dependent on each other versus the likelihood that they are not. For bigrams which have a positive LLR, we repeat the analysis by adding its neighbor word, and arrive at a list of trigrams with positive LLR, and so on, until we reach the N-gram level we think makes sense for the corpus. You can find an explanation of the math in one of my earlier posts, but you will probably find a better explanation in the LLG book.

For input data, I decided to use Twitter. I'm not that familiar with the Twitter API, but I'm taking the Introduction to Data Science course on Coursera, and the first assignment provided some code to pull data from the Twitter 1% feed, so I just reused that. I preprocess the feed so I am left with about 65k English tweets using the following code:

from __future__ import division

import json
import sys

def main():
  if len(sys.argv) != 2:
    print "Usage: %s /path/to/twitter/json/list.txt"
    sys.exit(-1)
  fin = open(sys.argv[1], 'rb')
  fout = open("twitter_messages.txt", 'wb')
  for line in fin:
    try:
      data = json.loads(line.strip())
      lang = data["lang"]
      if lang == "en":
        tweet = data["text"]
        tweet = tweet.replace("\n", " ").replace("\\s+", " ")
        tweet = tweet.encode("ascii", "ignore")
        if len(tweet) == 0:
          continue
        fout.write("%s\n" % (tweet))
    except KeyError:
      continue
  fin.close()
  fout.close()

if __name__ == "__main__":
  main()

The main code for extracting interesting phrases implements the math described above. The main difference is that I extend the approach to N-grams higher than 2. For trigrams, for example, I consider all trigrams which contain the bigrams with LLR > 0 as the first two words, and so on. The code consider upto 5-grams (although increasing it is simply a matter of changing the upper limit in the code). As it turns out, the largest significant N-gram in my data turns out to be a 4-gram.

Another important difference is that I have not used any heuristic to prune the N-gram set, other than LLR > 0. The LLG book prescibes additional heuristics for this such as keeping ngrams of nouns, verbs, adjectives or prepositions only, excluding all phrases that don't end with nouns, excluding phrases consisting of all stopwords, etc. I wasn't sure how effective they would be with Tweets, so I skipped them.

So anyway, here's the code. As always, all the code described in this post can be found in my github page for this subproject. The data is not included because I am not sure if there are legal implications to doing so, and anyway, its simple and painless enough to grab some data off Twitter and build up your own dataset.

from __future__ import division
import operator
import nltk
import numpy as np
from scipy.stats import binom
import string

def isValid(word):
  if word.startswith("#"):
    return False # no hashtag
  else:
    vword = word.translate(string.maketrans("", ""), string.punctuation)
    return len(vword) == len(word)

def llr(c1, c2, c12, n):
  # H0: Independence p(w1,w2) = p(w1,~w2) = c2/N
  p0 = c2 / n
  # H1: Dependence, p(w1,w2) = c12/N
  p10 = c12 / n
  # H1: p(~w1,w2) = (c2-c12)/N
  p11 = (c2 - c12) / n
  # binomial probabilities
  # H0: b(c12; c1, p0),  b(c2-c12; N-c1, p0)
  # H1: b(c12, c1, p10), b(c2-c12; N-c1, p11)
  probs = np.matrix([
    [binom(c1, p0).logpmf(c12), binom(n - c1, p0).logpmf(c2 - c12)],
    [binom(c1, p10).logpmf(c12), binom(n - c1, p11).logpmf(c2 - c12)]])
  # LLR = p(H1) / p(H0)
  return np.sum(probs[1, :]) - np.sum(probs[0, :])

def isLikelyNGram(ngram, phrases):
  if len(ngram) == 2:
    return True
  prevGram = ngram[:-1]
  return phrases.has_key(prevGram)

def main():
  # accumulate words and word frequency distributions
  lines = []
  unigramFD = nltk.FreqDist()
  fin = open("twitter_messages.txt", 'rb')
  i = 0
  for line in fin:
    i += 1
    words = nltk.word_tokenize(line.strip().lower())
    words = filter(lambda x: isValid(x), words)
    [unigramFD.inc(x) for x in words]
    lines.append(words)
    if i > 1000:
      break
  fin.close()
  # identify likely phrases using a multi-pass algorithm based
  # on the LLR approach described in the Building Search Applications
  # Lucene, LingPipe and GATE book, except that we treat n-gram
  # collocations beyond 2 as n-1 gram plus a unigram.
  phrases = nltk.defaultdict(float)
  prevGramFD = None
  for i in range(2, 5):
    ngramFD = nltk.FreqDist()
    for words in lines:
      nextGrams = nltk.ngrams(words, i)
      nextGrams = filter(lambda x: isLikelyNGram(x, phrases), nextGrams)
      [ngramFD.inc(x) for x in nextGrams]
    for k, v in ngramFD.iteritems():
      if v > 1:
        c1 = unigramFD[k[0]] if prevGramFD == None else prevGramFD[k[:-1]]
        c2 = unigramFD[k[1]] if prevGramFD == None else unigramFD[k[len(k) - 1]]
        c12 = ngramFD[k]
        n = unigramFD.N() if prevGramFD == None else prevGramFD.N()
        phrases[k] = llr(c1, c2, c12, n)
    # only consider bigrams where LLR > 0, ie P(H1) > P(H0)
    likelyPhrases = nltk.defaultdict(float)
    likelyPhrases.update([(k, v) for (k, v)
      in phrases.iteritems() if len(k) == i and v > 0])
    print "==== #-grams = %d ====" % (i)
    sortedPhrases = sorted(likelyPhrases.items(),
      key=operator.itemgetter(1), reverse=True)
    for k, v in sortedPhrases:
      print k, v
    prevGramFD = ngramFD

if __name__ == "__main__":
  main()

Here are the results of my run. As you can see, some of these, such as "solar eclipse", "the fact", etc, seem quite reasonable, and things like "one new unfollower" appear to be popular Twitterisms.

sujit@cyclone:phrases$ python interesting_phrases.py 
==== #-grams = 2 ====
('today', 'stats') 4.57956434533
('http', 'ipad') 1.45257884653
('the', 'same') 1.43108593163
('i', 'thought') 1.37514113846
('new', 'followers') 0.876128729563
('ea', 'ea') 0.749974634332
('my', 'favorite') 0.723215257916
('a', 'lot') 0.716683441626
('the', 'fact') 0.705937550309
('i', 'said') 0.668642986328
('my', 'head') 0.153309860951
('i', 'guess') 0.0987318394656
('solar', 'eclipse') 0.0902467987635
('fanboys', 'de') 0.0902467987635
('anybody', 'else') 0.0901414524373
('treat', 'yourself') 0.089930759785
('last', 'word') 0.089193335502
('who', 'wants') 0.0874024479576
('just', 'found') 0.084031365521
('this', 'whole') 0.0831885949118
('this', 'suck') 0.0831885949118
('on', 'saturday') 0.0827672096072
('my', 'daily') 0.0766571226909
('my', 'mom') 0.0766571226909
('my', 'dog') 0.0766571226909
('a', 'month') 0.0733913865804
('a', 'question') 0.0733913865804
('to', 'pass') 0.0700203041438
('the', 'son') 0.068018723947
('the', 'dark') 0.068018723947
('the', 'past') 0.068018723947
==== #-grams = 3 ====
('one', 'new', 'unfollower') 2.29374909449
('http', 'ipad', 'ipadgames') 1.49704633659
('2', 'new', 'unfollowers') 0.749659068453
('very', 'last', 'word') 0.0902221271564
('i', 'just', 'found') 0.0878684937321
==== #-grams = 4 ====
('and', 'one', 'new', 'unfollower') 0.176934229693

This is all I have for this week. I started off implementing this in order to remind myself of the mechanics of the approach, but it turned out to be quite a lot of fun. Hope you enjoyed it too.

Friday, May 03, 2013

Inter-Document Similarity with Scikit-Learn and NLTK

Someone recently asked me about using Python to calculate document similarity across text documents. The application had to do with cheating detection, ie, compare student transcripts and flag documents with (abnormally) high similarity for further investigation. For security reasons, I could not get access to actual student transcripts. But the basic idea was to convince ourselves that this approach is valid, and come up with a code template for doing this.

I have been playing quite a bit with NLTK lately, but for this work, I decided to use the Python ML Toolkit Scikit-Learn, which has pretty powerful text processing facilities. I did end up using NLTK for its cosine similarity function, but that was about it.

I decided to use the coffee-sugar-cocoa mini-corpus of 53 documents to test out the code - I first found this in Dr Manu Konchady's TextMine project, and I have used it off and on. For convenience I have made it available at the github location for the sub-project.

Heres the code. It first parses this data into a temporary one line per file version, then vectorizes each line and creates a term document matrix for the corpus.

from __future__ import division

from operator import itemgetter

from nltk.cluster.util import cosine_distance
import numpy as np
import random
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

def preprocess(fnin, fnout):
  fin = open(fnin, 'rb')
  fout = open(fnout, 'wb')
  buf = []
  id = ""
  category = ""
  for line in fin:
    line = line.strip()
    if line.find("-- Document Separator --") > -1:
      if len(buf) > 0:
        # write out body,
        body = re.sub("\s+", " ", " ".join(buf))
        fout.write("%s\t%s\t%s\n" % (id, category, body))
      # process next header and init buf
      id, category, rest = map(lambda x: x.strip(), line.split(": "))
      buf = []
    else:
      # process body
      buf.append(line)
  fin.close()
  fout.close()

def train(fnin):
  docs = []
  cats = []
  fin = open(fnin, 'rb')
  for line in fin:
    id, category, body = line.strip().split("\t")
    docs.append(body)
    cats.append(category)
  fin.close()
  pipeline = Pipeline([
    ("vect", CountVectorizer(min_df=0, stop_words="english")),
    ("tfidf", TfidfTransformer(use_idf=False))])
  tdMatrix = pipeline.fit_transform(docs, cats)
  return tdMatrix, cats

def test(tdMatrix, cats):
  testIds = random.sample(range(0, len(cats)), int(0.1 * len(cats)))
  testIdSet = set(testIds)
  refIds = filter(lambda x: x not in testIdSet, range(0, len(cats)))
  sims = np.zeros((len(testIds), len(refIds)))
  for i in range(0, len(testIds)):
    for j in range(0, len(refIds)):
      doc1 = np.asarray(tdMatrix[testIds[i], :].todense()).reshape(-1)
      doc2 = np.asarray(tdMatrix[refIds[j], :].todense()).reshape(-1)
      sims[i, j] = cosine_distance(doc1, doc2)
  for i in range(0, sims.shape[0]):
    xsim = list(enumerate(sims[i, :]))
    sortedSims = sorted(xsim, key=itemgetter(1), reverse=True)[0:5]
    sourceCat = cats[testIds[i]]
    numMatchedCats = 0
    numTestedCats = 0
    for j, score in sortedSims:
      targetCat = cats[j]
      if sourceCat == targetCat:
        numMatchedCats += 1
      numTestedCats += 1
    print("Test Doc: %d, Source Category: %s, Target Matched: %d/%d times" %
      (i, sourceCat, numMatchedCats, numTestedCats))
      
def main():
  preprocess("sugar-coffee-cocoa-docs.txt", "sccpp.txt")
  tdMatrix, cats = train("sccpp.txt")
  test(tdMatrix, cats)
  
if __name__ == "__main__":
  main()

The code then pulls out five random documents and tries to find the five most similar documents to it, and counts how many are in the same category as itself. As you can see, the results don't look too bad.

sujit@cyclone:docsim$ python docsim.py 
Test Doc: 0, Source Category: coffee, Target Matched: 3/5 times
Test Doc: 1, Source Category: cocoa, Target Matched: 2/5 times
Test Doc: 2, Source Category: sugar, Target Matched: 3/5 times
Test Doc: 3, Source Category: cocoa, Target Matched: 1/5 times
Test Doc: 4, Source Category: sugar, Target Matched: 2/5 times

Few things I tried was to turn IDF on (its currently off), and to use Eucledian distance instead of Cosine distance. The first actually made the results slightly worse over a set of runs, and the second did not seem to have any effect.

One thing to note is that this approach is unlikely to scale. Computing document-to-document similarity across the entire corpus (as you would need to do for a group of students) is an O(n²) operation. I have recently been reading up about Locally Sensitive Hashing that can mitigate this problem by being able to detect document clumps. Another approach may be to cluster the documents first and only consider inter-document similarities for those within each cluster.

And I guess thats it for today. The approach I took with Scikit-Learn is heavily inspired by the code in the text processing tutorial linked to above. For a slightly more math and code heavy treatment of text processing with Scikit-Learn check out Christian Perone's awesome blog posts here and here.

Update (2013-05-08) - I've been reading the Building Search Applications: Lucene, LingPipe and GATE by Dr Manu Konchady (henceforth referred to as the LLG book in this blog), and I came upon the SCAM (Standard Copy Analysis Mechanism) Algorithm for Plagiarism Detection, and it seemed better suited for my cheating domain than Cosine Similarity. Here is some Python code that implements the SCAM measure between two documents. This code and the update to the original code to use this can be found in my GitHub page for this subproject.

from __future__ import division
import numpy as np
import scipy.sparse as ss

def _s_pos_or_zero(x):
  return x if x > 0 else 0

def _s_zero_mask(x, y):
  return 0 if y == 0 else x

def _s_safe_divide(x, y):
  return 0 if x == 0 or y == 0 else x / y

_v_pos_or_zero = np.vectorize(_s_pos_or_zero)
_v_zero_mask = np.vectorize(_s_zero_mask)
_v_safe_divide = np.vectorize(_s_safe_divide)

def _assymetric_subset_measure(doc1, doc2):
  epsilon = np.ones(doc1.shape) * 2
  filtered = _v_pos_or_zero(epsilon - (_v_safe_divide(doc1, doc2) +
    _v_safe_divide(doc2, doc1)))
  zdoc1 = _v_zero_mask(doc1, filtered)
  zdoc2 = _v_zero_mask(doc2, filtered)
  return np.sum(np.dot(zdoc1, zdoc2)) / np.sum(np.dot(doc1, doc2))

def scam_distance(doc1, doc2):
  asm12 = _assymetric_subset_measure(doc1, doc2)
  asm21 = _assymetric_subset_measure(doc2, doc1)
  return max(asm12, asm21)

I decided to check this out by calculating the SCAM distance (should really be called SCAM similarity, since higher values indicate closeness) between my previous post, this post without the update and this post with the update. My expectation is that the score for the first pair should be lower than that for the second pair, and indeed it is so, giving 0.655172413793 and 0.954821894005 respectively.

Salmon Run

Saturday, May 25, 2013

Feature Selection with Scikit-Learn

Saturday, May 11, 2013

Finding Significant Phrases in Tweets with NLTK

Friday, May 03, 2013

Inter-Document Similarity with Scikit-Learn and NLTK

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me