Saturday, September 27, 2008

IR Math with Java : Similarity Measures

Last week, I wrote about building term document matrices based on Dr. Manu Konchady's Text Mining Application Programming book. This week, I continue working on computing some similarity metrics from the same set of data. Most of the material is covered in Chapter 8 of the book.

As before, our document collection consists of 7 document titles as shown below:

1
2
3
4
5
6
7
D1      Human machine interface for computer applications
D2      A survey of user opinion of computer system response time
D3      The EPS user interface management system
D4      System and human system engineering testing of EPS
D5      The generation of random, binary and ordered trees
D6      The intersection graph of paths in trees
D7      Graph minors: A survey

The document collection is first converted to a raw term frequency matrix. This is further normalized using either TF/IDF or LSI. The matrix thus normalized is the input to our similarity computations.

A document is represented by a column of data in our (normalized) term document matrix. To compute the similarity between two documents, we perform computations between two columns of the matrix. Conceptually, each document as a point in an n-dimensional term space, where each term corresponds to a dimension. So in our example, we have 23 terms, therefore our term space is a 23-dimensional space. In case this is hard to visualize (it was for me), it helps to think of documents with 2 terms (and hence a two dimensional term space), and extrapolate from there.

Jaccard Similarity

Jaccard Similarity is a simple and very intuitive measure of document to document similarity. It is defined as follows:

  similarity(A,B) = n(A ∩ B) / n(A ∪ B)

Using the Inclusion-Exclusion Principle, this reduces to:

  similarity(A,B) = n(A ∩ B) / n(A) + n(B) - n(A ∩ B)
  where:
    n(A ∩ B) = Σ min(Ai, Bi)
    n(A) = Σ Ai
    n(B) = Σ Bi
  for i = [0..n-1], where n = number of terms in our term-document matrix.

AbstractSimilarity.java

The AbstractSimilarity class shown below contains the code to compare all document pairs from our collection. This has a complexity of O(n2), which we can reduce to about half if we code in the knowledge that similarity(A,B) == similarity(B,A) and similarity(A,A) == 1.0, but which is still too high for large number of documents. I haven't added these knowledge here, though, since this is just toy data and the performance is tolerable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Source: src/main/java/com/mycompany/myapp/similarity/AbstractSimilarity.java
package com.mycompany.myapp.similarity;

import org.apache.commons.collections15.Transformer;

import Jama.Matrix;

public abstract class AbstractSimilarity 
    implements Transformer<Matrix, Matrix> {

  public Matrix transform(Matrix termDocumentMatrix) {
    int numDocs = termDocumentMatrix.getColumnDimension();
    Matrix similarityMatrix = new Matrix(numDocs, numDocs);
    for (int i = 0; i < numDocs; i++) {
      Matrix sourceDocMatrix = termDocumentMatrix.getMatrix(
        0, termDocumentMatrix.getRowDimension() - 1, i, i); 
      for (int j = 0; j < numDocs; j++) {
        Matrix targetDocMatrix = termDocumentMatrix.getMatrix(
          0, termDocumentMatrix.getRowDimension() - 1, j, j);
        similarityMatrix.set(i, j, 
          computeSimilarity(sourceDocMatrix, targetDocMatrix));
      }
    }
    return similarityMatrix;
  }

  protected abstract double computeSimilarity(
      Matrix sourceDoc, Matrix targetDoc);
}

JaccardSimilarity.java

For the JaccardSimilarity class, we define the abstract method computeSimilarity() as shown below (based on our formula above). I have cheated here slightly - in Jama, norm1() is defined as the maximum column sum, but since our document matrices are column-matrices (ie 1 column), norm1() returns the same value as the sum of all elements in the document matrix. It does make the code less readable, in the sense that the reader/maintainer would have to go through an extra mental hoop, but it makes the code concise.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Source: src/main/java/com/mycompany/myapp/similarity/JaccardSimilarity.java
package com.mycompany.myapp.similarity;

import Jama.Matrix;

public class JaccardSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix source, Matrix target) {
    double intersection = 0.0D;
    for (int i = 0; i < source.getRowDimension();i++) {
      intersection += Math.min(source.get(i, 0), target.get(i, 0));
    }
    if (intersection > 0.0D) {
      double union = source.norm1() + target.norm1() - intersection;
      return intersection / union;
    } else {
      return 0.0D;
    }
  }
}

The result of our Jaccard similarity computation is shown below. The JUnit test to return this is SimilarityTest.java which is shown later in this post. The two matrices are derived from our raw term document matrix normalized using TF/IDF and LSI respectively. As you can see, the similarity matrix created off the LSI normalized vector shows more inter-document similarity than the one created off the TF/IDF normalized vector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
=== Jaccard Similarity (TF/IDF) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  0.0000  0.0818  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0710
    D3  0.0000  0.0000  1.0000  0.2254  0.0000  0.0000  0.0000
    D4  0.0818  0.0000  0.2254  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
    D6  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  0.1781
    D7  0.0000  0.0710  0.0000  0.0000  0.0000  0.1781  1.0000

=== Jaccard Similarity (LSI) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D3  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D4  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D6  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D7  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000

Cosine Similarity

Cosine Similarity is the more popular but also a slightly more complex measure of similarity. It is defined as:

  similarity(A,B) = cos θ = (A ⋅ B) / (|A| * |B|)
  where:
    A ⋅ B = Σ Ai * Bi
    |A| = sqrt(Σ Ai2)
    |B| = sqrt(Σ Bi2)
  for i = [0..n-1], where n = number of terms in our term-document matrix.

Here, θ is the angle between the gradients of the documents A and B in the n-dimensional term space. A and B are very similar as θ approaches 0° (cos(0) = 1), and very dissimilar as θ approaches 90° (cos(90) = 0).

Dr. E. Garcia has a very informative tutorial on Cosine Similarity that explains the formula above in much more detail (and with more humor).

CosineSimilarity.java

The code below computes the cosine similarity based on the formula above. Here, normF() is the Frobenius norm, the square root of the sum of all elements, and norm1() is the maximum column sum, which works for me because we are computing it off a one-dimensional matrix.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Source: src/main/java/src/com/mycompany/myapp/similarity/CosineSimilarity.jav
a
package com.healthline.jrocker.similarity;

import Jama.Matrix;

public class CosineSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) {
    double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1();
    double eucledianDist = sourceDoc.normF() * targetDoc.normF();
    return dotProduct / eucledianDist;
  }
}

The results of this computation for a term document vector normalized with TF/IDF and LSI respectively are shown below. As before, the inter-document similarity is higher for the LSI normalized vector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
=== Cosine Similarity (TF/IDF) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  0.0000  0.1316  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.1680
    D3  0.0000  0.0000  1.0000  0.4198  0.0000  0.0000  0.0000
    D4  0.1316  0.0000  0.4198  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
    D6  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  0.3154
    D7  0.0000  0.1680  0.0000  0.0000  0.0000  0.3154  1.0000

=== Cosine Similarity (LSI) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D3  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D4  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D6  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D7  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000

Searching - Similarity against a Query Vector

Using the similarity vectors above, it is now possible to build a Searcher. The objective is to find the most similar documents that match our query. For this, we will need to build a query vector consisting of the terms in the query, and then compute the similarity of the documents against the query vector. The computation is similar to what we have already done above if we treat the query vector as a specialized instance of a document vector. We use Cosine Similarity for our computations here.

Searcher.java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
// Source: src/main/java/com/mycompany/myapp/similarity/Searcher.java
package com.healthline.jrocker.similarity;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import Jama.Matrix;

public class Searcher {

  public class SearchResult {
    public String title;
    public double score;
    public SearchResult(String title, double score) {
      this.title = title;
      this.score = score;
    }
  };
  
  private Matrix termDocumentMatrix;
  private List<String> documents;
  private List<String> terms;
  private AbstractSimilarity similarity;
  
  public void setTermDocumentMatrix(Matrix termDocumentMatrix) {
    this.termDocumentMatrix = termDocumentMatrix;
  }
  
  public void setDocuments(String[] documents) {
    this.documents = Arrays.asList(documents);
  }
  
  public void setTerms(String[] terms) {
    this.terms = Arrays.asList(terms);
  }
  
  public void setSimilarity(AbstractSimilarity similarity) {
    this.similarity = similarity;
  }
  
  public List<SearchResult> search(String query) {
    // build up query matrix
    Matrix queryMatrix = getQueryMatrix(query);
    final Map<String,Double> similarityMap = 
      new HashMap<String,Double>();
    for (int i = 0; i < termDocumentMatrix.getColumnDimension(); i++) {
      double sim = similarity.computeSimilarity(queryMatrix, 
        termDocumentMatrix.getMatrix(
          0, termDocumentMatrix.getRowDimension() - 1, i, i));
      if (sim > 0.0D) {
        similarityMap.put(documents.get(i), sim);
      }
    }
    return sortByScore(similarityMap);
  }
  
  private Matrix getQueryMatrix(String query) {
    Matrix queryMatrix = new Matrix(terms.size(), 1, 0.0D);
    String[] queryTerms = query.split("\\s+");
    for (String queryTerm : queryTerms) {
      int termIdx = 0;
      for (String term : terms) {
        if (queryTerm.equalsIgnoreCase(term)) {
          queryMatrix.set(termIdx, 0, 1.0D);
        }
        termIdx++;
      }
    }
    queryMatrix = queryMatrix.times(1 / queryMatrix.norm1());
    return queryMatrix;
  }
  
  private List<SearchResult> sortByScore(
      final Map<String,Double> similarityMap) {
    List<SearchResult> results = new ArrayList<SearchResult>();
    List<String> docNames = new ArrayList<String>();
    docNames.addAll(similarityMap.keySet());
    Collections.sort(docNames, new Comparator<String>() {
      public int compare(String s1, String s2) {
        return similarityMap.get(s2).compareTo(similarityMap.get(s1));
      }
    });
    for (String docName : docNames) {
      double score = similarityMap.get(docName);
      if (score < 0.00001D) {
        continue;
      }
      results.add(new SearchResult(docName, score));
    }
    return results;
  }
}

The results of searching with the query "human computer interface" against a similarity matrix built off a TF/IDF and an LSI normalization are shown below. Notice that we get a few more results from a LSI normalized vector. I have manually copied the titles over to make the result more readable. I suppose I could make the code do it, and I probably would have, if the code was going to be used by anybody other than myself.

1
2
3
4
5
6
7
8
Results for query: [human computer interface] using TF/IDF
D1 (score =   0.8431) Human machine interface for computer applications
D4 (score =   0.1881) System and human system engineering testing of EPS

Results for query: [human computer interface] using LSI
D4 (score =   0.2467) System and human system engineering testing of EPS
D3 (score =   0.2467) The EPS user interface management system
D1 (score =   0.2467) Human machine interface for computer applications

Calling the code - the JUnit test

For completeness, as well as to show you how the components above should be called, I show here the JUnit test that was used to run these components and derive results for this post.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
// Source: src/test/java/com/mycompany/myapp/similarity/SimilarityTest.java
package com.mycompany.myapp.similarity;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;

import org.apache.commons.lang.StringUtils;
import org.junit.Before;
import org.junit.Test;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import Jama.Matrix;

import com.mycompany.myapp.indexers.IdfIndexer;
import com.mycompany.myapp.indexers.LsiIndexer;
import com.mycompany.myapp.indexers.VectorGenerator;
import com.mycompany.myapp.similarity.Searcher.SearchResult;

public class SimilarityTest {

  private VectorGenerator vectorGenerator;
  
  @Before
  public void setUp() throws Exception {
    vectorGenerator = new VectorGenerator();
    vectorGenerator.setDataSource(new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", 
      "tmdb", "irstuff"));
    Map<String,Reader> documents = 
      new LinkedHashMap<String,Reader>();
    BufferedReader reader = new BufferedReader(
      new FileReader("src/test/resources/data/indexing_sample_data.txt"));
    String line = null;
    while ((line = reader.readLine()) != null) {
      String[] docTitleParts = StringUtils.split(line, ";");
      documents.put(docTitleParts[0], new StringReader(docTitleParts[1]));
    }
    vectorGenerator.generateVector(documents);
  }

  @Test
  public void testJaccardSimilarityWithTfIdfVector() throws Exception {
    IdfIndexer indexer = new IdfIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    JaccardSimilarity jaccardSimilarity = new JaccardSimilarity();
    Matrix similarity = jaccardSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Jaccard Similarity (TF/IDF)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }
  
  @Test
  public void testJaccardSimilarityWithLsiVector() throws Exception {
    LsiIndexer indexer = new LsiIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    JaccardSimilarity jaccardSimilarity = new JaccardSimilarity();
    Matrix similarity = jaccardSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Jaccard Similarity (LSI)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }

  @Test
  public void testCosineSimilarityWithTfIdfVector() throws Exception {
    IdfIndexer indexer = new IdfIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    CosineSimilarity cosineSimilarity = new CosineSimilarity();
    Matrix similarity = cosineSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Cosine Similarity (TF/IDF)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }
  
  @Test
  public void testCosineSimilarityWithLsiVector() throws Exception {
    LsiIndexer indexer = new LsiIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    CosineSimilarity cosineSimilarity = new CosineSimilarity();
    Matrix similarity = cosineSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Cosine Similarity (LSI)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }
  
  @Test
  public void testSearchWithTfIdfVector() throws Exception {
    // generate the term document matrix via the appropriate indexer
    IdfIndexer indexer = new IdfIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    // set up the query
    Searcher searcher = new Searcher();
    searcher.setDocuments(vectorGenerator.getDocumentNames());
    searcher.setTerms(vectorGenerator.getWords());
    searcher.setSimilarity(new CosineSimilarity());
    searcher.setTermDocumentMatrix(termDocMatrix);
    // run the query
    List<SearchResult> results = 
      searcher.search("human computer interface");
    prettyPrintResults("human computer interface", results);
  }

  @Test
  public void testSearchWithLsiVector() throws Exception {
    // generate the term document matrix via the appropriate indexer
    LsiIndexer indexer = new LsiIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    // set up the query
    Searcher searcher = new Searcher();
    searcher.setDocuments(vectorGenerator.getDocumentNames());
    searcher.setTerms(vectorGenerator.getWords());
    searcher.setSimilarity(new CosineSimilarity());
    searcher.setTermDocumentMatrix(termDocMatrix);
    // run the query
    List<SearchResult> results = 
      searcher.search("human computer interface");
    prettyPrintResults("human computer interface", results);
  }

  private void prettyPrintMatrix(String legend, Matrix matrix, 
      String[] documentNames, PrintWriter writer) {
    writer.printf("=== %s ===%n", legend);
    writer.printf("%6s", " ");
    for (int i = 0; i < documentNames.length; i++) {
      writer.printf("%8s", documentNames[i]);
    }
    writer.println();
    for (int i = 0; i < documentNames.length; i++) {
      writer.printf("%6s", documentNames[i]);
      for (int j = 0; j < documentNames.length; j++) {
        writer.printf("%8.4f", matrix.get(i, j));
      }
      writer.println();
    }
    writer.flush();
  }
  
  private void prettyPrintResults(String query, 
      List<SearchResult> results) {
    System.out.printf("Results for query: [%s]%n", query);
    for (SearchResult result : results) {
      System.out.printf("%s (score = %8.4f)%n", result.title, result.score);
    }
  }
}

Conclusion

I hope this post was helpful. I had looked at Term Vectors in Lucene in the past, but the idea of an n-dimensional term space never quite sank in until I started working through the stuff I describe above. I know that the data I am using is toy data, and that the computations described in the blog are computationally too complex and impractical in real life, but I believe that knowing this stuff helps in building models that would be useful and practical.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

Saturday, September 20, 2008

IR Math with Java : TF, IDF and LSI

Recently, I started working on a fairly large text-mining project. During that time I have seen several common sense heuristics being designed and applied with very good results (some of them from me :-)), so I think a big part of being an IR (Information Retrieval) programmer is the ability to think quantitatively and be able to model problems in simple mathematical or statistical terms. Unless you are some kind of math genius (which I am not) or already have a background in applied math, it helps to know something about the models that are being used or proposed to solve various classes of problems, in order to have a starting point.

Text Mining Application Programming by Dr. Manu Konchady, provides a lot of the math background I am looking for. The book targets programmers, not mathematicians or scientists, so it's easy to read (for me). It provides lucid explanations (with pseudo-code in the book and Perl code in the author's TextMine project) for basic algorithms used to solve some IR problems. The book doesn't cover advanced approaches, as one of my colleagues pointed out to me, but it provides a good base which one can use to research more advanced approaches.

I've been working through this book, off and on, since I bought it. I learn better by doing, so I try to build the components that are described in that chapter. I have written about these efforts earlier here and here. In this post, I describe my code for generating various types of "indexes" (really term/document matrices based off a toy collection of documents) based on the algorithms discussed in Chapter 3 of the book.

The book describes three types of indexing approaches - term frequency (TF), inverse document frequency (IDF) and latent semantic indexing (LSI). To compute the frequency matrix, it takes a collection of 7 titles and creates a term document vector by tokenizing the titles. The list of 7 document titles are shown below:

1
2
3
4
5
6
7
D1      Human machine interface for computer applications
D2      A survey of user opinion of computer system response time
D3      The EPS user interface management system
D4      System and human system engineering testing of EPS
D5      The generation of random, binary and ordered trees
D6      The intersection graph of paths in trees
D7      Graph minors: A survey

Raw Frequency Extraction

To extract the frequencies, we must first extract content words and phrases from the text. I have described tokenizers and token recognizers in earlier posts. For this work, we create two additional recognizers, a stop word recognizer and a content word recognizer.

New recognizer: StopwordRecognizer.java

This recognizer has its own list of stop words if called with a default (no-args) constructor. It can also be instantiated with a List of custom stopwords (from a custom document collection using Zipf's Law) in case that is desired. It checks for TokenType.WORD (so if a word is already classified as abbreviation or phrase, it will not be touched), and if it's value is in it's stop set, then its marked as TokenType.STOP_WORD.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Source: src/main/java/com/mycompany/myapp/recognizers/StopwordRecognizer.java
package com.mycompany.myapp.recognizers;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.lang.StringUtils;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

/**
 * A recognizer that recognizes common stop words. Special stopwords may
 * be passed in through the non-default constructor.
 */
public class StopwordRecognizer implements IRecognizer {

  // this list is taken from the TextMine project
  private static final String DEFAULT_STOPWORDS = 
    "a about add ago after all also an and another any are as at be " +
    "because been before being between big both but by came can come " +
    "could did do does due each else end far few for from get got had " +
    "has have he her here him himself his how if in into is it its " +
    "just let lie like low make many me might more most much must " +
    "my never no nor not now of off old on only or other our out over " +
    "per pre put re said same see she should since so some still such " +
    "take than that the their them then there these they this those " +
    "through to too under up use very via want was way we well were " +
    "what when where which while who will with would yes yet you your";

  private Set<String> stopwords = new HashSet<String>();
  
  public StopwordRecognizer() {
    super();
  }
  
  public StopwordRecognizer(String[] stopwords) {
    this.stopwords.addAll(Arrays.asList(stopwords));
  }
  
  public void init() throws Exception {
    if (stopwords.size() == 0) {
      String[] stopwordArray = StringUtils.split(DEFAULT_STOPWORDS, " ");
      stopwords.addAll(Arrays.asList(stopwordArray));
    }
  }

  public List<Token> recognize(List<Token> tokens) {
    List<Token> recognizedTokens = new ArrayList<Token>();
    for (Token token : tokens) {
      if (token.getType() == TokenType.WORD) {
        if (stopwords.contains(StringUtils.lowerCase(token.getValue()))) {
          token.setType(TokenType.STOP_WORD);
        }
      }
      recognizedTokens.add(token);
    }
    return recognizedTokens;
  }
}

New recognizer: ContentWordRecognizer.java

This recognizer filters out nouns, verbs, adjectives and adverbs and marks them as TokenType.CONTENT_WORDS. As in the previous recognizer, only words which are TokenType.WORD are acted on. The part-of-speech recognition is done using the WordNet dictionary, and the API to it is the MIT Java WordNet Interface.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Source: src/main/java/com/mycompany/myapp/recognizers/ContentWordRecognizer.java
package com.mycompany.myapp.recognizers;

import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.item.IIndexWord;
import edu.mit.jwi.item.POS;

/**
 * Recognizes content words (noun, verb, adjective, and adverb) from a
 * List of Token objects. Only TokenType.WORD tokens are considered in
 * this recognizer, and are converted to TokenType.CONTENT_WORD. Words
 * are looked up against the WordNet dictionary.
 */
public class ContentWordRecognizer implements IRecognizer {

  private IDictionary dictionary;
  private List<POS> allowablePartsOfSpeech = Arrays.asList(new POS[] {
    POS.NOUN, POS.VERB, POS.ADJECTIVE, POS.ADVERB});
  
  public void init() throws Exception {
    this.dictionary = new Dictionary(
      new URL("file", null, "/opt/wordnet-3.0/dict"));
    dictionary.open();
  }

  public List<Token> recognize(List<Token> tokens) {
    List<Token> outputTokens = new ArrayList<Token>();
    for (Token token : tokens) {
      Token outputToken = new Token(token.getValue(), token.getType());
      if (token.getType() == TokenType.WORD) {
        String word = token.getValue();
        for (POS allowablePartOfSpeech : allowablePartsOfSpeech) {
          IIndexWord indexWord = 
            dictionary.getIndexWord(word, allowablePartOfSpeech);
          if (indexWord != null) {
            outputToken.setType(TokenType.CONTENT_WORD);
            break;
          }
        }
      }
      outputTokens.add(outputToken);
    }
    return outputTokens;
  }
}

Generating the initial vector: VectorGenerator.java

The initial vector is created from a Map of document names to Readers pointing at the titles. This may seem overly complex for our particular situation, where we could have done with a Map<String,String>, but I was going for a more general solution with Map<String,Reader> where the Reader reads files of content. The code for the VectorGenerator is shown below. Its fairly simple, it tokenizes the titles into words, then for each word, passes it through a chain of recognizers. At the end of it, it only extracts the content words and creates a term-document vector as shown below:

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
// Source: src/main/java/com/mycompany/myapp/indexers/VectorGenerator.java
package com.mycompany.myapp.indexers;

import java.io.PrintWriter;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedSet;
import java.util.TreeSet;

import javax.sql.DataSource;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.lang.StringUtils;
import org.springframework.beans.factory.annotation.Required;

import Jama.Matrix;

import com.mycompany.myapp.recognizers.AbbreviationRecognizer;
import com.mycompany.myapp.recognizers.BoundaryRecognizer;
import com.mycompany.myapp.recognizers.ContentWordRecognizer;
import com.mycompany.myapp.recognizers.IRecognizer;
import com.mycompany.myapp.recognizers.PhraseRecognizer;
import com.mycompany.myapp.recognizers.RecognizerChain;
import com.mycompany.myapp.recognizers.StopwordRecognizer;
import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;
import com.mycompany.myapp.tokenizers.WordTokenizer;

/**
 * Generate the word occurence vector for a document collection.
 */
public class VectorGenerator {

  private DataSource dataSource;
  
  private Map<Integer,String> wordIdValueMap = 
    new HashMap<Integer,String>();
  private Map<Integer,String> documentIdNameMap = 
    new HashMap<Integer,String>();
  private Matrix matrix;

  @Required
  public void setDataSource(DataSource dataSource) {
    this.dataSource = dataSource;
  }

  public void generateVector(Map<String,Reader> documents) 
      throws Exception {
    Map<String,Bag<String>> documentWordFrequencyMap = 
      new HashMap<String,Bag<String>>();
    SortedSet<String> wordSet = new TreeSet<String>();
    Integer docId = 0;
    for (String key : documents.keySet()) {
      String text = getText(documents.get(key));
      Bag<String> wordFrequencies = getWordFrequencies(text);
      wordSet.addAll(wordFrequencies.uniqueSet());
      documentWordFrequencyMap.put(key, wordFrequencies);
      documentIdNameMap.put(docId, key);
      docId++;
    }
    // create a Map of ids to words from the wordSet
    int wordId = 0;
    for (String word : wordSet) {
      wordIdValueMap.put(wordId, word);
      wordId++;
    }
    // we need a documents.keySet().size() x wordSet.size() matrix to hold
    // this info
    int numDocs = documents.keySet().size();
    int numWords = wordSet.size();
    double[][] data = new double[numWords][numDocs];
    for (int i = 0; i < numWords; i++) {
      for (int j = 0; j < numDocs; j++) {
        String docName = documentIdNameMap.get(j);
        Bag<String> wordFrequencies = 
          documentWordFrequencyMap.get(docName);
        String word = wordIdValueMap.get(i);
        int count = wordFrequencies.getCount(word);
        data[i][j] = count;
      }
    }
    matrix = new Matrix(data);
  }

  public Matrix getMatrix() {
    return matrix;
  }
  
  public String[] getDocumentNames() {
    String[] documentNames = new String[documentIdNameMap.keySet().size()];
    for (int i = 0; i < documentNames.length; i++) {
      documentNames[i] = documentIdNameMap.get(i);
    }
    return documentNames;
  }
  
  public String[] getWords() {
    String[] words = new String[wordIdValueMap.keySet().size()];
    for (int i = 0; i < words.length; i++) {
      String word = wordIdValueMap.get(i);
      if (word.contains("|||")) {
        // phrases are stored with length for other purposes, strip it off
        // for this report.
        word = word.substring(0, word.indexOf("|||"));
      }
      words[i] = word;
    }
    return words;
  }

  private Bag<String> getWordFrequencies(String text) 
      throws Exception {
    Bag<String> wordBag = new HashBag<String>();
    WordTokenizer wordTokenizer = new WordTokenizer();
    wordTokenizer.setText(text);
    List<Token> tokens = new ArrayList<Token>();
    Token token = null;
    while ((token = wordTokenizer.nextToken()) != null) {
      tokens.add(token);
    }
    RecognizerChain recognizerChain = new RecognizerChain(
        Arrays.asList(new IRecognizer[] {
        new BoundaryRecognizer(),
        new AbbreviationRecognizer(dataSource),
        new PhraseRecognizer(dataSource),
        new StopwordRecognizer(),
        new ContentWordRecognizer()
    }));
    recognizerChain.init();
    List<Token> recognizedTokens = recognizerChain.recognize(tokens);
    for (Token recognizedToken : recognizedTokens) {
      if (recognizedToken.getType() == TokenType.ABBREVIATION ||
          recognizedToken.getType() == TokenType.PHRASE ||
          recognizedToken.getType() == TokenType.CONTENT_WORD) {
        // lowercase words to treat Human and human as the same word
        wordBag.add(StringUtils.lowerCase(recognizedToken.getValue()));
      }
    }
    return wordBag;
  }

  private String getText(Reader reader) throws Exception {
    StringBuilder textBuilder = new StringBuilder();
    char[] cbuf = new char[1024];
    int len = 0;
    while ((len = reader.read(cbuf, 0, 1024)) != -1) {
      textBuilder.append(ArrayUtils.subarray(cbuf, 0, len));
    }
    reader.close();
    return textBuilder.toString();
  }
}

The test case (to run each example) consists of a single JUnit test (see below), which simply instantiates and runs each "indexer" implementation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// Source: src/test/java/com/mycompany/myapp/indexers/IndexersTest.java
package com.mycompany.myapp.indexers;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringReader;
import java.util.LinkedHashMap;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.junit.Before;
import org.junit.Test;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import Jama.Matrix;

public class IndexersTest {

  private VectorGenerator vectorGenerator;
  private Map<String,Reader> documents;
  
  @Before
  public void setUp() throws Exception {
    vectorGenerator = new VectorGenerator();
    vectorGenerator.setDataSource(new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", 
      "tmdb", "irstuff"));
    documents = new LinkedHashMap<String,Reader>();
    BufferedReader reader = new BufferedReader(
      new FileReader("src/test/resources/data/indexing_sample_data.txt"));
    String line = null;
    while ((line = reader.readLine()) != null) {
      String[] docTitleParts = StringUtils.split(line, ";");
      documents.put(docTitleParts[0], new StringReader(docTitleParts[1]));
    }
  }
  
  @Test
  public void testVectorGeneration() throws Exception {
    vectorGenerator.generateVector(documents);
    prettyPrintMatrix("Raw Term Frequencies", vectorGenerator.getMatrix(), 
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), 
      new PrintWriter(System.out, true));
  }
  
  @Test
  public void testTfIndexer() throws Exception {
    vectorGenerator.generateVector(documents);
    TfIndexer indexer = new TfIndexer();

    Matrix tfMatrix = indexer.transform(vectorGenerator.getMatrix());
    prettyPrintMatrix("Term Frequency", tfMatrix, 
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), 
      new PrintWriter(System.out, true));
  }
  
  @Test
  public void testIdfIndexer() throws Exception {
    vectorGenerator.generateVector(documents);
    IdfIndexer indexer = new IdfIndexer();
    Matrix idfMatrix = indexer.transform(vectorGenerator.getMatrix());
    prettyPrintMatrix("Inverse Document Frequency", idfMatrix,
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(),
      new PrintWriter(System.out, true));
  }
  
  @Test
  public void testLsiIndexer() throws Exception {
    vectorGenerator.generateVector(documents);
    LsiIndexer indexer = new LsiIndexer();
    Matrix lsiMatrix = indexer.transform(vectorGenerator.getMatrix());
    prettyPrintMatrix("Latent Semantic (LSI)", lsiMatrix,
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(),
      new PrintWriter(System.out, true));
  }
  
  private void prettyPrintMatrix(String legend, Matrix matrix, 
      String[] documentNames, String[] words, PrintWriter writer) {
    writer.printf("=== %s ===%n", legend);
    writer.printf("%15s", " ");
    for (int i = 0; i < documentNames.length; i++) {
      writer.printf("%8s", documentNames[i]);
    }
    writer.println();
    for (int i = 0; i < words.length; i++) {
      writer.printf("%15s", words[i]);
      for (int j = 0; j < documentNames.length; j++) {
        writer.printf("%8.4f", matrix.get(i, j));
      }
      writer.println();
    }
    writer.flush();
  }
}

The first test in the JUnit test outputs our initial raw matrix. Later tests will create it again (see @Before) and operate on it in different ways. Here is what the raw matrix will look like:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
=== Raw Term Frequencies ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
       computer  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
computer system  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
    engineering  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000
            eps  0.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
          graph  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  1.0000
          human  1.0000  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000
      interface  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
   intersection  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000
        machine  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
     management  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000
         minors  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000
        opinion  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
        ordered  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
       response  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
         survey  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  1.0000
         system  0.0000  0.0000  1.0000  2.0000  0.0000  0.0000  0.0000
        testing  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000
           time  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
           user  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
 user interface  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000

Term Frequency Indexing

The term frequency indexing method is the most simplistic, all it does is normalize the raw frequencies across a single document. Thus, if a document had two words, one occuring twice and the other occuring thrice, the first word would be normalized to 2/5 (0.4) and the other to 3/5 (0.6). We have used Jama, a Java library algebra library, because of its ability to do SVD (but more on that later). Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Source: src/main/java/com/mycompany/myapp/indexers/TfIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;

import Jama.Matrix;

/**
 * Normalizes the occurence matrix per document. Divides each entry by the
 * sum of occurence values for that column. That way a longer document which
 * has 2 occurences of a certain word will not be ranked higher than a
 * shorter document with 1 occurence of the word for that word. At the 
 * end of this transformation, the values are the frequency of the word 
 * in the document.
 */
public class TfIndexer implements Transformer<Matrix,Matrix> {

  public Matrix transform(Matrix matrix) {
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getMatrix(
        0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.set(i, j, (matrix.get(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(Matrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.get(i, 0);
    }
    return sum;
  }
}

The results of this computation is shown below. Notice that all the columns now add up to 1, meaning that all documents are being treated the same, regardless of their length (and consequently their number of content words).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
=== Term Frequency ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       computer  0.2500  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
computer system  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
    engineering  0.0000  0.0000  0.0000  0.1667  0.0000  0.0000  0.0000
            eps  0.0000  0.0000  0.2500  0.1667  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
          graph  0.0000  0.0000  0.0000  0.0000  0.0000  0.5000  0.3333
          human  0.2500  0.0000  0.0000  0.1667  0.0000  0.0000  0.0000
      interface  0.2500  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
   intersection  0.0000  0.0000  0.0000  0.0000  0.0000  0.5000  0.0000
        machine  0.2500  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
     management  0.0000  0.0000  0.2500  0.0000  0.0000  0.0000  0.0000
         minors  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.3333
        opinion  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
        ordered  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       response  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
         survey  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.3333
         system  0.0000  0.0000  0.2500  0.3333  0.0000  0.0000  0.0000
        testing  0.0000  0.0000  0.0000  0.1667  0.0000  0.0000  0.0000
           time  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
           user  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
 user interface  0.0000  0.0000  0.2500  0.0000  0.0000  0.0000  0.0000

Inverse Document Frequency Indexing

Inverse Document Frequency attempts to smooth out the frequency of a word across documents. If a word occurs in more than one document, that means that it is less "precise" and hence its value should go down. The code below will take the raw vector and apply IDF to it in the form of a logarithmic smoothing operator, then normalize the results. So this is a combination of TF and IDF. The smoothing operator is:

weighti,j = term_frequencyi,j * (1 + log(N) - log(di)
  where:
    weighti,j = value of the IDF matrix for documenti, wordj.
    term_frequencyi,j = raw frequency of the word at position (i,j).
    N = number of documents
    di = number of documents containing word i.

The indexer code is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Source: src/main/java/net/sf/jtmt/indexers/IdfIndexer.java
package net.sf.jtmt.indexers;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.math.linear.RealMatrix;

/**
 * Reduces the weight of words which are commonly found (ie in more
 * documents). The factor by which it is reduced is chosen from the book
 * as:
 * f(m) = 1 + log(N/d(m))
 * where N = total number of docs in collection
 *       d(m) = number of docs containing word m
 * so where a word is more frequent (ie d(m) is high, f(m) would be low.
 */
public class IdfIndexer implements Transformer<RealMatrix,RealMatrix> {

  public RealMatrix transform(RealMatrix matrix) {
    // Phase 1: apply IDF weight to the raw word frequencies
    int n = matrix.getColumnDimension();
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        double matrixElement = matrix.getEntry(i, j);
        if (matrixElement > 0.0D) {
          double dm = countDocsWithWord(
            matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1));
          matrix.setEntry(i, j, matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm)));
        }
      }
    }
    // Phase 2: normalize the word scores for a single document
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getSubMatrix(0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(RealMatrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.getEntry(i, 0);
    }
    return sum;
  }

  private double countDocsWithWord(RealMatrix rowMatrix) {
    double numDocs = 0.0D;
    for (int j = 0; j < rowMatrix.getColumnDimension(); j++) {
      if (rowMatrix.getEntry(0, j) > 0.0D) {
        numDocs++;
      }
    }
    return numDocs;
  }
}

The resulting vector after IDF and normalization is applied is shown below. Notice that scores for words (such as human) which occur in more than one document has decreased.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
=== Inverse Document Frequency ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       computer  0.2656  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
computer system  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
    engineering  0.0000  0.0000  0.0000  0.1977  0.0000  0.0000  0.0000
            eps  0.0000  0.0000  0.2167  0.1512  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
          graph  0.0000  0.0000  0.0000  0.0000  0.0000  0.4333  0.3023
          human  0.2031  0.0000  0.0000  0.1512  0.0000  0.0000  0.0000
      interface  0.2656  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
   intersection  0.0000  0.0000  0.0000  0.0000  0.0000  0.5667  0.0000
        machine  0.2656  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
     management  0.0000  0.0000  0.2833  0.0000  0.0000  0.0000  0.0000
         minors  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.3953
        opinion  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
        ordered  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       response  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
         survey  0.0000  0.1327  0.0000  0.0000  0.0000  0.0000  0.3023
         system  0.0000  0.0000  0.2167  0.3023  0.0000  0.0000  0.0000
        testing  0.0000  0.0000  0.0000  0.1977  0.0000  0.0000  0.0000
           time  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
           user  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
 user interface  0.0000  0.0000  0.2833  0.0000  0.0000  0.0000  0.0000

Latent Semantic Indexing (LSI)

Latent Semantic Indexing attempts to uncover latent relationships among documents based on word co-occurence. So if document A contains (w1,w2) and document B contains (w2,w3), we can conclude that there is something common between documents A and B. LSI does this by decomposing the input raw term frequency matrix (A, see below) into three different matrices (U, S and V) using Singular Value Decomposition (SVD). Once that is done, the three vectors are "reduced" and the original vector rebuilt from the reduced vectors. Because of the reduction, noisy relationships are suppressed and relations become very clearly visible. In pseudo-code:

A = U * S * VT
  Ak = Uk * Sk * VkT
  where:
    A = the original matrix
    U = the word vector
    S = the sigma vector
    V = the document vector
    Uk = the reduced word submatrix consisting of 0..k-1 cols
    Sk = the reduced sigma submatrix consisting of 0..k-1 cols, 0..k-1 rows
    Vk = the reduced document submatrix consisting of 0..k-1 cols.
  Note:
    Jama will give you back V, so you need to reduce and transpose it
    before you compute Ak.

Dr E Garcia used to have a really good tutorial on LSI/SVD which is sadly no longer available. However, the IR Book has a chapter dedicated to this. Thanks to ndk for suggesting this link.

As mentioned before, Jama was chosen because it was the only free Java library package I knew of that could do SVD. The code for the LSI Indexer is here:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Source: src/main/java/com/mycompany/myapp/indexers/LsiIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;

import Jama.Matrix;
import Jama.SingularValueDecomposition;

/**
 * Uses Latent Semantic Indexing to find word associations between docs. 
 * Idea is to find the intersections of the words found in each document
 * and score them accordingly. We use SVD to accomplish this. We first
 * decompose the word frequency vector into the three parts, then multiply
 * the three components back to get our transformed matrix.
 */
public class LsiIndexer implements Transformer<Matrix,Matrix> {

  public Matrix transform(Matrix matrix) {
    // phase 1: Singular value decomposition
    SingularValueDecomposition svd = new SingularValueDecomposition(matrix);
    Matrix wordVector = svd.getU();
    Matrix sigma = svd.getS();
    Matrix documentVector = svd.getV();
    // compute the value of k (ie where to truncate)
    int k = (int) Math.floor(Math.sqrt(matrix.getColumnDimension()));
    Matrix reducedWordVector = wordVector.getMatrix(
      0, wordVector.getRowDimension() - 1, 0, k - 1);
    Matrix reducedSigma = sigma.getMatrix(0, k - 1, 0, k - 1);
    Matrix reducedDocumentVector = documentVector.getMatrix(
      0, documentVector.getRowDimension() - 1, 0, k - 1);
    Matrix weights = reducedWordVector.times(
      reducedSigma).times(reducedDocumentVector.transpose());
    // Phase 2: normalize the word scrores for a single document
    for (int j = 0; j < weights.getColumnDimension(); j++) {
      double sum = sum(weights.getMatrix(
        0, weights.getRowDimension() - 1, j, j));
      for (int i = 0; i < weights.getRowDimension(); i++) {
        weights.set(i, j, Math.abs((weights.get(i, j)) / sum));
      }
    }
    return weights;
  }

  private double sum(Matrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.get(i, 0);
    }
    return sum;
  }
}

And here is the output from the indexer. First the raw frequencies go through the singular value decomposition, reduction and recomposition process, then they are normalized for each document. Notice that there are more non-zero elements representing latent "relationship" weights.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
=== Latent Semantic (LSI) ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
       computer  0.0198  0.0000  0.0198  0.0198  0.0000  0.0000  0.0000
computer system  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
    engineering  0.1138  0.0000  0.1138  0.1138  0.0000  0.0000  0.0000
            eps  0.1733  0.0000  0.1733  0.1733  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
          graph  0.0000  0.0559  0.0000  0.0000  0.0559  0.0559  0.0559
          human  0.1336  0.0000  0.1336  0.1336  0.0000  0.0000  0.0000
      interface  0.0198  0.0000  0.0198  0.0198  0.0000  0.0000  0.0000
   intersection  0.0000  0.0105  0.0000  0.0000  0.0105  0.0105  0.0105
        machine  0.0198  0.0000  0.0198  0.0198  0.0000  0.0000  0.0000
     management  0.0595  0.0000  0.0595  0.0595  0.0000  0.0000  0.0000
         minors  0.0000  0.0454  0.0000  0.0000  0.0454  0.0454  0.0454
        opinion  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
        ordered  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
       response  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
         survey  0.0000  0.1859  0.0000  0.0000  0.1859  0.1859  0.1859
         system  0.2871  0.0000  0.2871  0.2871  0.0000  0.0000  0.0000
        testing  0.1138  0.0000  0.1138  0.1138  0.0000  0.0000  0.0000
           time  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
           user  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
 user interface  0.0595  0.0000  0.0595  0.0595  0.0000  0.0000  0.0000
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.458 sec

Conclusion

The first two indexing methods is probably familiar to a lot of people, and it is very likely that the ones in use (indirectly from common IR libraries such as Lucene) in most shops are quite a bit more advanced than the ones shown. LSI using SVD was a new approach to me, it became intuitively obvious once I understood the process. Hopefully, this article was able to share some of my new-found insight with you. The Java code for each of these processes illustrates how easy it is to actually do these transforms, especially using libraries that do most of the heavy lifting.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.