Saturday, November 08, 2008

IR Math in Java : HMM Based POS Tagger/Recognizer

As you know, I have been slowly working my way through Dr Konchady's TMAP book, and coding up the algorithms in Java. By doing so, I hope to understand the techniques and mathematical models presented, so I can apply them to real-life problems in the future. In this post I describe an implementation of a Hidden Markov Model based Part of Speech recognizer/tagger, based on the material presented in Chapter 4 of the TMAP book.

Background

What follows is my take on what an HMM is and how it can be used for Part of Speech (POS) tagging. For a more detailed, math-heavy, and possibly more accurate description of HMM and their internals, read the Wikipedia article or Dr Rabiner's tutorial or the TMAP book if you happen to own it. A Hidden Markov Model can be thought of as a probabilistic finite state machine. Its states can be represented by the set S = {S1, S2, ..., Sn} which are not directly visible. What is visible is a set of Observations O = {O1, O2, ..., Om} which are the result of the machine moving from one state to the other. The probabilities of the machine starting in one of the states Si is specified by the one-dimensional matrix Π of size n. The probabilities of the machine moving from one state to another is specified by a two dimensional matrix A of size n*n. Finally, the probability of an observation being observed when the machine is in a certain state is given by the two dimensional matrix B of size n*m. More succintly,

  H = f(Π, A, B)
  where:
    H = the HMM
    Π = start probabilities. The element Πi represents 
        the probability that the HMM starts a sequence in State
        State Si, where i in (0..n-1).
    A = transition probabilities. The element Ai,j represents
        the probability of a transition from State Si to
        State Sj, where i and j in (0..n-1).
    B = emission probabilities. The element Bi,j represents
        the probability of an Observation Oj occurring while
        the machine is in State Si, where i in (0..n-1)
        and j in (0..m-1).
    n = number of states.
    m = number of unique observations.

The objective of POS tagging is to tag each word of a sentence with its part-of-speech tag. While some words can be unambiguously tagged, ie their is only one POS that the word is ever used for, there are quite a few which can represent different POS depending on its usage in the sentence. For example, cold can be both a noun and an adjective, and catch can be both a noun and a verb. The fact that the word exists in the sentence is known, while the POS for the word is unknown. Therefore the HMM built for POS tagging would model the words as visible observations and the set of possible POS as the hidden states.

As far as POS tagging is concerned, the main problems that can be solved by HMMs are as follows. Given an HMM,

  1. Finding the most likely state sequence for a given observation sequence. In this case, we pass in a sentence, and tag each word with its most likely POS.
  2. Finding the most likely state for a given observation in a sequence. This is useful for word sense disambiguation (WSD), so we can tell the most likely POS that a particular word in a sentence belongs to.

The problems above are identical from the point of view of a HMM, and are solved using the Viterbi algorithm.

The second problem is to find the probability of a certain sequence of observations. This can be used to answer questions such as whether a sentence such as "I am going to the market" is more common than one such as "To the market I go". The Forward-Backward Algorithm is used to solve this kind of problem. This can be useful for applications that predict the most likely outcome given a set of input observations, but probably is not important from the perspective of POS tagging. Both the above problems need to have a HMM built from (manually) tagged data.

A third problem of HMMs is how to build one given a corpus of untagged text. Such an HMM would allow us to solve the second type of problem. However, since the HMM has not been fed with tagged words, it must depend on a clustering algorithm such as K-Means to cluster the words into undefined hidden states, which are of no use when attempting to solve the second type of problem. The two learning algorithms used here are the K-Means Algorithm to build the initial HMM and the Baum-Welch Algorithm to refine it. As with the second type of problem, this does not have much applications where POS taggers are concerned.

I used the Java HMM library Jahmm to do all of the heavy computational lifting. It has implementations of the algorithms mentioned above, as well as several utility methods and classes to model various kinds of Observation.

Building the HMM from Tagged Data

For my tagged corpus, I used the Brown Corpus, downloading the data from the Natural Language Toolkit Project (NTLP). The corpus is a set of about 500 files containing one sentence per line, each manually tagged with a very comprehensive set of POS tags described here. Since I plan to use Wordnet at some point with this data, and Wordnet only categorizes words as one of 4 categories, I set up my own Part of Speech Enum called Pos which has 5 categories, the 4 from Wordnet and OTHER. As a result, I had to the dumb the Brown tags down using the translation table shown below:

BTAG POS BTAG POS BTAG POS BTAG POS
( OTHER FW-CS OTHER MD VERB RBR+CS ADVERB
) OTHER FW-DT OTHER MD* VERB RBT ADVERB
* OTHER FW-DT+BEZ OTHER MD+HV VERB RN ADVERB
, OTHER FW-DTS OTHER MD+PPSS VERB RP ADVERB
-- OTHER FW-HV VERB MD+TO VERB RP+IN ADVERB
. OTHER FW-IN OTHER NN NOUN TO OTHER
: OTHER FW-IN+AT OTHER NN$ NOUN TO+VB VERB
ABL OTHER FW-IN+NN OTHER NN+BEZ NOUN UH OTHER
ABN OTHER FW-IN+NP OTHER NN+HVD NOUN VB VERB
ABX OTHER FW-JJ ADJECTIVE NN+HVZ NOUN VB+AT VERB
AP OTHER FW-JJR ADJECTIVE NN+IN NOUN VB+IN VERB
AP$ OTHER FW-JJT ADJECTIVE NN+MD NOUN VB+JJ VERB
AP+AP OTHER FW-NN NOUN NN+NN NOUN VB+PPO VERB
AT ADJECTIVE FW-NN$ NOUN NNS NOUN VB+RP VERB
BE VERB FW-NNS NOUN NNS$ NOUN VB+TO VERB
BED VERB FW-NP NOUN NNS+MD NOUN VB+VB VERB
BED* VERB FW-NPS NOUN NP NOUN VBD VERB
BEDZ VERB FW-NR NOUN NP$ NOUN VBG VERB
BEDZ* VERB FW-OD NOUN NP+BEZ NOUN VBG+TO VERB
BEG VERB FW-PN OTHER NP+HVZ NOUN VBN VERB
BEM VERB FW-PP$ OTHER NP+MD NOUN VBN+TO VERB
BEM* VERB FW-PPL OTHER NPS NOUN VBZ VERB
BEN VERB FW-PPL+VBZ OTHER NPS$ NOUN WDT OTHER
BER VERB FW-PPO OTHER NR NOUN WDT+BER OTHER
BER* VERB FW-PPO+IN OTHER NR$ NOUN WDT+BER+PP OTHER
BEZ VERB FW-PPS OTHER NR+MD NOUN WDT+BEZ OTHER
BEZ* VERB FW-PPSS OTHER NRS NOUN WDT+DO+PPS OTHER
CC OTHER FW-PPSS+HV OTHER OD NOUN WDT+DOD OTHER
CD NOUN FW-QL OTHER PN OTHER WDT+HVZ OTHER
CD$ NOUN FW-RB ADVERB PN$ OTHER WP$ OTHER
CS OTHER FW-RB+CC ADVERB PN+BEZ OTHER WPO OTHER
DO VERB FW-TO+VB VERB PN+HVD OTHER WPS OTHER
DO* VERB FW-UH OTHER PN+HVZ OTHER WPS+BEZ OTHER
DO+PPSS VERB FW-VB VERB PN+MD OTHER WPS+HVD OTHER
DOD VERB FW-VBD VERB PP$ OTHER WPS+HVZ OTHER
DOD* VERB FW-VBG VERB PP$$ OTHER WPS+MD OTHER
DOZ VERB FW-VBN VERB PPL OTHER WQL OTHER
DOZ* VERB FW-VBZ VERB PPLS OTHER WRB ADVERB
DT OTHER FW-WDT OTHER PPO OTHER WRB+BER ADVERB
DT$ OTHER FW-WPO OTHER PPS OTHER WRB+BEZ ADVERB
DT+BEZ OTHER FW-WPS OTHER PPS+BEZ OTHER WRB+DO ADVERB
DT+MD OTHER HV VERB PPS+HVD OTHER WRB+DOD ADVERB
DTI OTHER HV* VERB PPS+HVZ OTHER WRB+DOD* ADVERB
DTS OTHER HV+TO VERB PPS+MD OTHER WRB+DOZ ADVERB
DTS+BEZ OTHER HVD VERB PPSS OTHER WRB+IN ADVERB
DTX OTHER HVD* VERB PPSS+BEM OTHER WRB+MD ADVERB
EX VERB HVG VERB PPSS+BER OTHER - -
EX+BEZ VERB HVN VERB PPSS+BEZ OTHER - -
EX+HVD VERB HVZ VERB PPSS+BEZ* OTHER - -
EX+HVZ VERB HVZ* VERB PPSS+HV OTHER - -
EX+MD VERB IN OTHER PPSS+HVD OTHER - -
FW-* OTHER IN+IN OTHER PPSS+MD OTHER - -
FW-AT ADJECTIVE IN+PPO OTHER PPSS+VB OTHER - -
FW-AT+NN ADJECTIVE JJ ADJECTIVE QL OTHER - -
FW-AT+NP ADJECTIVE JJ$ ADJECTIVE QLP OTHER - -
FW-BE VERB JJ+JJ ADJECTIVE RB ADVERB - -
FW-BER VERB JJR ADJECTIVE RB$ ADVERB - -
FW-BEZ VERB JJR+CS ADJECTIVE RB+BEZ ADVERB - -
FW-CC OTHER JJS ADJECTIVE RB+CS ADVERB - -
FW-CD NOUN JJT ADJECTIVE RBR ADVERB - -

You may notice that some of the mappings are not correct. Unfortunately, my knowledge of formal English grammar is not as good as I would like it to be, owing to having been educated in an environment that posited that a person can learn to recognize incorrect grammar better by reading and writing enough gramatically correct sentences rather than through a study of language rules. My grandfather, obviously regarding all this as crazy talk, briefly attempted to rectify that, armed with a Wren and Martin and a 18" ruler, but as you can probably see, it did not work out all that well :-).

The code for the Pos Enum is shown below. As mentioned earlier, it exposes a set of 5 POS values, and has a convenience method to convert the Brown tag into corresponding Pos.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Source: src/main/java/com/mycompany/myapp/postaggers/Pos.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.lang.StringUtils;

/**
 * Enumeration of Parts of Speech being considered. Conversions from
 * Brown Tags and Wordnet tags are handled by convenience methods.
 */
public enum Pos {

  NOUN, VERB, ADJECTIVE, ADVERB, OTHER;

  private static Map<String,Pos> bmap = null;
  private static final String translationFile = 
    "src/main/resources/brown_tags.txt";
  
  public static Pos fromBrownTag(String btag) throws Exception {
    if (bmap == null) {
      bmap = new HashMap<String,Pos>();
      BufferedReader reader = new BufferedReader(new InputStreamReader(
          new FileInputStream(translationFile)));
      String line;
      while ((line = reader.readLine()) != null) {
        if (line.startsWith("#")) {
          continue;
        }
        String[] cols = StringUtils.split(line, "\t");
        bmap.put(StringUtils.lowerCase(cols[0]), Pos.valueOf(cols[1])); 
      }
      reader.close();
    }
    Pos pos = bmap.get(btag);
    if (pos == null) {
      return Pos.OTHER;
    }
    return pos;
  }
}

The BrownCorpusReader reads through each tagged file in the Brown Corpus directory, extracts the word and the tag out of each tagged word, converts the Brown tag to its equivalent Pos value, and accumulates the occurrences into internal counters. Once all files are processed, the counters are normalized into the three probability matrices Π, A and B that we spoke about earlier.

Since the number of words tagged in any corpus is potentially quite large, we represent the words (or observations) in the HMM as an integer. That is why the BrownCorpusReader also dumps out a list of unique words it found in the corpus into a flat file which can be pulled back into memory later to do the mapping between the word and the integer observation Id.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
// Source: src/main/java/com/mycompany/myapp/postaggers/BrownCorpusReader.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.StringTokenizer;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import Jama.Matrix;

/**
 * Reads a file or directory of tagged text from Brown Corpus and
 * computes the various probability matrices for the HMM.
 */
public class BrownCorpusReader {

  private final Log log = LogFactory.getLog(getClass());
  
  private String dataFilesLocation;
  private String wordDictionaryLocation;
  private boolean debug;

  private Bag<String> piCounts = new HashBag<String>();
  private Bag<String> aCounts = new HashBag<String>();
  private Map<String,Double[]> wordPosMap = 
    new HashMap<String,Double[]>();
  
  private Matrix pi;
  private Matrix a;
  private Matrix b;
  private List<String> words;
  
  public void setDataFilesLocation(String dataFilesLocation) {
    this.dataFilesLocation = dataFilesLocation;
  }
  
  public void setWordDictionaryLocation(String wordDictionaryLocation) {
    this.wordDictionaryLocation = wordDictionaryLocation;
  }
  
  public void setDebug(boolean debug) {
    this.debug = debug;
  }

  public void read() throws Exception {
    File location = new File(dataFilesLocation);
    File[] inputs;
    if (location.isDirectory()) {
      inputs = location.listFiles();
    } else {
      inputs = new File[] {location};
    }
    int currfile = 0;
    int totfiles = inputs.length;
    for (File input : inputs) {
      currfile++;
      log.info("Processing file (" + currfile + "/" + totfiles + "): " + 
        input.getName());
      BufferedReader reader = new BufferedReader(new InputStreamReader(
        new FileInputStream(input)));
      String line;
      while ((line = reader.readLine()) != null) {
        if (StringUtils.isEmpty(line)) {
          continue;
        }
        StringTokenizer tok = new StringTokenizer(line, " ");
        int wordIndex = 0;
        Pos prevPos = null;
        while (tok.hasMoreTokens()) {
          String taggedWord = tok.nextToken();
          String[] wordTagPair = StringUtils.split(
            StringUtils.lowerCase(StringUtils.trim(taggedWord)), "/");
          if (wordTagPair.length != 2) {
            continue;
          }
          Pos pos = Pos.fromBrownTag(wordTagPair[1]);
          if (! wordPosMap.containsKey(wordTagPair[0])) {
            // create an entry
            Double[] posProbs = new Double[Pos.values().length];
            for (int i = 0; i < posProbs.length; i++) {
              posProbs[i] = new Double(0.0D);
            }
            wordPosMap.put(wordTagPair[0], posProbs);
          }
          Double[] posProbs = wordPosMap.get(wordTagPair[0]);
          posProbs[pos.ordinal()] += 1.0D;
          wordPosMap.put(wordTagPair[0], posProbs);
          if (wordIndex == 0) {
            // first word, update piCounts
            piCounts.add(pos.name());
          } else {
            aCounts.add(StringUtils.join(new String[] {
              prevPos.name(), pos.name()}, ":"));
          }
          prevPos = pos;
          wordIndex++;
        }
      }
      reader.close();
    }
    // normalize counts to probabilities
    int numPos = Pos.values().length;
    // compute pi
    pi = new Matrix(numPos, 1);
    for (int i = 0; i < numPos; i++) {
      pi.set(i, 0, piCounts.getCount((Pos.values()[i]).name()));
    }
    pi = pi.times(1 / pi.norm1());
    // compute a
    a = new Matrix(numPos, numPos);
    for (int i = 0; i < numPos; i++) {
      for (int j = 0; j < numPos; j++) {
        a.set(i, j, aCounts.getCount(StringUtils.join(new String[] {
          (Pos.values()[i]).name(), (Pos.values()[j]).name()
        }, ":")));
      }
    }
    // compute b
    int numWords = wordPosMap.size();
    words = new ArrayList<String>();
    words.addAll(wordPosMap.keySet());
    b = new Matrix(numPos, numWords);
    for (int i = 0; i < numPos; i++) {
      for (int j = 0; j < numWords; j++) {
        String word = words.get(j);
        b.set(i, j, wordPosMap.get(word)[i]);
      }
    }
    // normalize across rows for a and b (sum of cols in each row == 1.0)
    for (int i = 0; i < numPos; i++) {
      double rowSumA = 0.0D;
      for (int j = 0; j < numPos; j++) {
        rowSumA += a.get(i, j);
      }
      for (int j = 0; j < numPos; j++) {
        a.set(i, j, (a.get(i, j) / rowSumA));
      }
      double rowSumB = 0.0D;
      for (int j = 0; j < numWords; j++) {
        rowSumB += b.get(i, j);
      }
      for (int j = 0; j < numWords; j++) {
        b.set(i, j, (b.get(i, j) / rowSumB));
      }
    }
    // write out brown word dictionary for later use
    writeDictionary();
    // debug
    if (debug) {
      pi.print(8, 4);
      a.print(8, 4);
      b.print(8, 4);
      System.out.println(words.toString());
    }
  }
  
  public List<String> getWords() {
    return words;
  }
  
  public double[] getPi() {
    double[] pia = new double[pi.getRowDimension()];
    for (int i = 0; i < pia.length; i++) {
      pia[i] = pi.get(i, 0);
    }
    return pia;
  }
  
  public double[][] getA() {
    double[][] aa = new double[a.getRowDimension()][a.getColumnDimension()];
    for (int i = 0; i < a.getRowDimension(); i++) {
      for (int j = 0; j < a.getColumnDimension(); j++) {
        aa[i][j] = a.get(i, j);
      }
    }
    return aa;
  }
  
  public double[][] getB() {
    double[][] ba = new double[b.getRowDimension()][b.getColumnDimension()];
    for (int i = 0; i < b.getRowDimension(); i++) {
      for (int j = 0; j < b.getColumnDimension(); j++) {
        ba[i][j] = b.get(i, j);
      }
    }
    return ba;
  }

  private void writeDictionary() throws Exception {
    FileWriter dictWriter = new FileWriter(wordDictionaryLocation);
    for (String word : words) {
      dictWriter.write(word + "\n");
    }
    dictWriter.flush();
    dictWriter.close();
  }
}

We generate the HMM and serialize it to disk as a flat file. That decouples the building of the HMM from the actual usage, and saves a few CPU cycles and makes the tests run a bit faster. In addition, if this solution was to be used in a real-life situation, it would be much faster to load the HMM from a flat file than to build it from a tagged corpus. Our serialized HMM file looks like this (edited to truncate the number of observations for readability).

On a quick side note, the Jahmm example uses the ObservationDiscrete class based on an Enum to model a small finite set of observations. This works well if the number of observations in your set are small and well known. In our case, we consider a unique word as an observation, and we have approximately 3900 of them, so we used the ObservationInteger class to model the observation, and our flat file serves as a mapping between the integer id for the Observation to the actual word.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Hmm v1.0

NbStates 5

State
Pi 0.127
A 0.155 0.156 0.019 0.025 0.645 
IntegerOPDF [0 0 0.00002 0.00003 0 0 0.00001 0 0.00001 ...]

State
Pi 0.057
A 0.095 0.195 0.168 0.094 0.449 
IntegerOPDF [0 0 0 0 0 0.00001 0 0 0 0 0 0 0.00001 0.00005 ...]

State
Pi 0.164
A 0.639 0.024 0.148 0.005 0.183 
IntegerOPDF [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...]

State
Pi 0.083
A 0.052 0.228 0.111 0.041 0.569 
IntegerOPDF [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...]

State
Pi 0.569
A 0.206 0.199 0.205 0.039 0.351 
IntegerOPDF [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0032 0.00033 0.00064 ...]

The following JUnit test snippet shows how we use the HmmTagger class (described below) to call the BrownCorpusReader and build and persist the HMM.

1
2
3
4
5
6
7
8
9
  @Test
  public void testBuildFromBrownAndWrite() throws Exception {
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat");
    Hmm<ObservationInteger> hmm = hmmTagger.buildFromBrownCorpus();
    hmmTagger.saveToFile(hmm);
  }

HMM Tagger class

I then create a HmmTagger class that can build an HMM from the BrownCorpusReader as well as from a serialized HMM file shown above. The HmmTagger contains all the methods that are needed to solve the common HMM problems listed above. The code for the HmmTagger is as follows:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
// Source: src/main/java/com/mycompany/myapp/postaggers/HmmTagger.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import be.ac.ulg.montefiore.run.jahmm.ForwardBackwardCalculator;
import be.ac.ulg.montefiore.run.jahmm.Hmm;
import be.ac.ulg.montefiore.run.jahmm.ObservationInteger;
import be.ac.ulg.montefiore.run.jahmm.OpdfInteger;
import be.ac.ulg.montefiore.run.jahmm.OpdfIntegerFactory;
import be.ac.ulg.montefiore.run.jahmm.ViterbiCalculator;
import be.ac.ulg.montefiore.run.jahmm.io.HmmReader;
import be.ac.ulg.montefiore.run.jahmm.io.HmmWriter;
import be.ac.ulg.montefiore.run.jahmm.io.OpdfIntegerReader;
import be.ac.ulg.montefiore.run.jahmm.io.OpdfReader;
import be.ac.ulg.montefiore.run.jahmm.io.OpdfWriter;
import be.ac.ulg.montefiore.run.jahmm.learn.BaumWelchLearner;
import be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner;
import be.ac.ulg.montefiore.run.jahmm.toolbox.KullbackLeiblerDistanceCalculator;

/**
 * HMM based POS Tagger.
 */
public class HmmTagger {

  private static final DecimalFormat OBS_FORMAT = 
    new DecimalFormat("##.#####");
  
  private final Log log = LogFactory.getLog(getClass());

  private String dataDir;
  private String dictionaryLocation;
  private String hmmFileName;
  
  private Map<String,Integer> words = 
    new HashMap<String,Integer>();
  
  public void setDataDir(String brownDataDir) {
    this.dataDir = brownDataDir;
  }

  public void setDictionaryLocation(String dictionaryLocation) {
    this.dictionaryLocation = dictionaryLocation;
  }

  public void setHmmFileName(String hmmFileName) {
    this.hmmFileName = hmmFileName;
  }

  /**
   * Builds up an HMM where states are parts of speech given by the Pos
   * Enum, and the observations are actual words in the tagged Brown
   * corpus. Each integer observation corresponds to the position of 
   * a word found in the Brown corpus.
   * @return an HMM.
   * @throws Exception if one is thrown.
   */
  public Hmm<ObservationInteger> buildFromBrownCorpus() 
      throws Exception {
    BrownCorpusReader brownReader = new BrownCorpusReader();
    brownReader.setDataFilesLocation(dataDir);
    brownReader.setWordDictionaryLocation(dictionaryLocation);
    brownReader.read();
    int nbStates = Pos.values().length;
    OpdfIntegerFactory factory = new OpdfIntegerFactory(nbStates);
    Hmm<ObservationInteger> hmm = 
      new Hmm<ObservationInteger>(nbStates, factory); 
    double[] pi = brownReader.getPi();
    for (int i = 0; i < nbStates; i++) {
      hmm.setPi(i, pi[i]);
    }
    double[][] a = brownReader.getA();
    for (int i = 0; i < nbStates; i++) {
      for (int j = 0; j < nbStates; j++) {
        hmm.setAij(i, j, a[i][j]);
      }
    }
    double[][] b = brownReader.getB();
    for (int i = 0; i < nbStates; i++) {
      for (int j = 0; j < nbStates; j++) {
        hmm.setOpdf(i, new OpdfInteger(b[i]));
      }
    }
    int seq = 0;
    for (String word : brownReader.getWords()) {
      words.put(word, seq);
      seq++;
    }
    return hmm;
  }
  
  /**
   * Builds an HMM from a formatted file describing the HMM. The format is
   * specified by the Jahmm project, and it has utility methods to read and
   * write HMMs from and to text files. We use this because the builder that
   * builds an HMM from the Brown corpus is computationally intensive and
   * this strategy provides us a way to partition the process.
   * @return a HMM
   * @throws Exception if one is thrown.
   */
  public Hmm<ObservationInteger> buildFromHmmFile() throws Exception {
    File hmmFile = new File(hmmFileName);
    if (! hmmFile.exists()) {
      throw new Exception("HMM File: " + hmmFile.getName() + 
        " does not exist");
    }
    FileReader fileReader = new FileReader(hmmFile);
    OpdfReader<OpdfInteger> opdfReader = new OpdfIntegerReader();
    Hmm<ObservationInteger> hmm = 
      HmmReader.read(fileReader, opdfReader);
    return hmm;
  }
  
  /**
   * Utility method to save an HMM into a formatted text file describing the
   * HMM. The format is specified by the Jahmm project, which also provides
   * utility methods to write a HMM to the text file.
   * @param hmm the HMM to write.
   * @throws Exception if one is thrown.
   */
  public void saveToFile(Hmm<ObservationInteger> hmm) 
      throws Exception {
    FileWriter fileWriter = new FileWriter(hmmFileName);
    // we create our own impl of the OpdfIntegerWriter because we want
    // to control the formatting of the opdf probabilities. With the 
    // default OpdfIntegerWriter, small probabilities get written in 
    // the exponential format, ie 1.234..E-4, which the HmmReader does
    // not recognize.
    OpdfWriter<OpdfInteger> opdfWriter = 
      new OpdfWriter<OpdfInteger>() {
        @Override
        public void write(Writer writer, OpdfInteger opdf) 
            throws IOException {
          String s = "IntegerOPDF [";
          for (int i = 0; i < opdf.nbEntries(); i++)
            s += OBS_FORMAT.format(opdf.probability(
              new ObservationInteger(i))) + " ";
            writer.write(s + "]\n");
          }
    };
    HmmWriter.write(fileWriter, opdfWriter, hmm);
    fileWriter.flush();
    fileWriter.close();
  }

  /**
   * Given the HMM, returns the probability of observing the sequence 
   * of words specified in the sentence. Uses the Forward-Backward 
   * algorithm to compute the probability.
   * @param sentence the sentence to check.
   * @param hmm a reference to a prebuilt HMM.
   * @return the probability of observing this sequence.
   * @throws Exception if one is thrown.
   */
  public double getObservationProbability(String sentence, 
      Hmm<ObservationInteger> hmm) throws Exception {
    String[] tokens = tokenizeSentence(sentence);
    List<ObservationInteger> observations = getObservations(tokens);
    ForwardBackwardCalculator fbc = 
      new ForwardBackwardCalculator(observations, hmm);
    return fbc.probability();
  }

  /**
   * Given an HMM and an untagged sentence, tags each word with the part of
   * speech it is most likely to belong in. Uses the Viterbi algorithm.
   * @param sentence the sentence to tag.
   * @param hmm the HMM to use.
   * @return a tagged sentence.
   * @throws Exception if one is thrown.
   */
  public String tagSentence(String sentence, 
      Hmm<ObservationInteger> hmm) throws Exception {
    String[] tokens = tokenizeSentence(sentence);
    List<ObservationInteger> observations = getObservations(tokens);
    ViterbiCalculator vc = new ViterbiCalculator(observations, hmm);
    int[] ids = vc.stateSequence();
    StringBuilder tagBuilder = new StringBuilder();
    for (int i = 0; i < ids.length; i++) {
      tagBuilder.append(tokens[i]).
        append("/").
        append((Pos.values()[ids[i]]).name()).
        append(" ");
    }
    return tagBuilder.toString();
  }
  
  /**
   * Given an HMM, a sentence and a word within the sentence which needs to 
   * be disambiguated, returns the most likely Pos for the specified word.
   * @param word the word to find the Pos for.
   * @param sentence the sentence.
   * @param hmm the HMM.
   * @return the most likely POS.
   * @throws Exception if one is thrown.
   */
  public Pos getMostLikelyPos(String word, String sentence, 
      Hmm<ObservationInteger> hmm) throws Exception {
    if (words == null || words.size() == 0) {
      loadWordsFromDictionary();
    }
    String[] tokens = tokenizeSentence(sentence);
    List<ObservationInteger> observations = getObservations(tokens);
    int wordPos = -1;
    for (int i = 0; i < tokens.length; i++) {
      if (tokens[i].equalsIgnoreCase(word)) {
        wordPos = i;
        break;
      }
    }
    if (wordPos == -1) {
      throw new IllegalArgumentException("Word [" + word + 
        "] does not exist in sentence [" + sentence + "]");
    }
    ViterbiCalculator vc = new ViterbiCalculator(observations, hmm);
    int[] ids = vc.stateSequence();
    return Pos.values()[ids[wordPos]];
  }

  /**
   * Given an existing HMM, this method will send in a List of sentences from
   * a possibly different untagged source, to refine the HMM.
   * @param sentences the List of sentences to teach.
   * @return a HMM that has been taught using the observation sequences.
   * @throws Exception if one is thrown.
   */
  public Hmm<ObservationInteger> teach(List<String> sentences)
      throws Exception {
    if (words == null || words.size() == 0) {
      loadWordsFromDictionary();
    }
    OpdfIntegerFactory factory = new OpdfIntegerFactory(words.size());
    List<List<ObservationInteger>> sequences = 
      new ArrayList<List<ObservationInteger>>();
    for (String sentence : sentences) {
      List<ObservationInteger> sequence = 
        getObservations(tokenizeSentence(sentence));
      sequences.add(sequence);
    }
    KMeansLearner<ObservationInteger> kml = 
      new KMeansLearner<ObservationInteger>(
      Pos.values().length, factory, sequences);
    Hmm<ObservationInteger> hmm = kml.iterate();
    // refine it with Baum-Welch Learner
    BaumWelchLearner bwl = new BaumWelchLearner();
    Hmm<ObservationInteger> refinedHmm = bwl.iterate(hmm, sequences);
    return refinedHmm;
  }
  
  /**
   * Convenience method to compute the distance between two HMMs. This can 
   * be used to stop the teaching process once more teaching is not
   * producing any appreciable improvement in the HMM, ie, the HMM
   * converges. The caller will need to match the result of this method 
   * with a number based on experience.
   * @param hmm1 the original HMM.
   * @param hmm2 the HMM that was most recently taught.
   * @return the difference measure between the two HMMs.
   * @throws Exception if one is thrown.
   */
  public double difference(Hmm<ObservationInteger> hmm1,
      Hmm<ObservationInteger> hmm2) throws Exception {
    KullbackLeiblerDistanceCalculator kdc = 
      new KullbackLeiblerDistanceCalculator();
    return kdc.distance(hmm1, hmm2);
  }
  
  private String[] tokenizeSentence(String sentence) {
    String[] tokens = StringUtils.split(
      StringUtils.lowerCase(StringUtils.trim(sentence)), " ");
    return tokens;
  }
  
  private List<ObservationInteger> getObservations(String[] tokens)
      throws Exception {
    if (words == null || words.size() == 0) {
      loadWordsFromDictionary();
    }
    List<ObservationInteger> observations = 
      new ArrayList<ObservationInteger>();
    for (String token : tokens) {
      observations.add(new ObservationInteger(words.get(token)));
    }
    return observations;
  }
  
  private void loadWordsFromDictionary() throws Exception {
    BufferedReader reader = new BufferedReader(
      new FileReader(dictionaryLocation));
    String word;
    int seq = 0;
    while ((word = reader.readLine()) != null) {
      words.put(word, seq);
      seq++;
    }
    reader.close();
  }
}

Word Sense Disambiguation

Given a sentence, a human user can figure the correct POS for each word almost immediately, but with an HMM, we can only tell which is the most likely POS for the word given the sequence of words in the sentence. Obviously, this depends on how large and accurate the HMM's training data set was. Here is how the HmmTagger is called to determine the most likely POS for a word in the sentence.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  @Test
  public void testWordSenseDisambiguation() throws Exception {
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat");
    Hmm<ObservationInteger> hmm = 
      hmmTagger.buildFromHmmFile();
    String[] testSentences = new String[] {
      "The dog ran after the cat .",
      "Failure dogs his path .",
      "The cold steel cuts through the flesh .",
      "He had a bad cold .",
      "He will catch the ball .",
      "Salmon is the catch of the day ."
    };
    String[] testWords = new String[] {
      "dog",
      "dogs",
      "cold",
      "cold",
      "catch",
      "catch"
    };
    for (int i = 0; i < testSentences.length; i++) {
      System.out.println("Original sentence: " + testSentences[i]);
      Pos wordPos = hmmTagger.getMostLikelyPos(testWords[i], 
        testSentences[i], hmm); 
      System.out.println("Pos(" + testWords[i] + ")=" + wordPos);
    }
  }

And here are the results. As you can see, the HMM did well on all but the second sentence.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Original sentence: The dog ran after the cat .
Pos(dog)=NOUN

Original sentence: Failure dogs his path .
Pos(dogs)=NOUN

Original sentence: The cold steel cuts through the flesh .
Pos(cold)=ADJECTIVE

Original sentence: He had a bad cold .
Pos(cold)=NOUN

Original sentence: He will catch the ball .
Pos(catch)=VERB

Original sentence: Salmon is the catch of the day .
Pos(catch)=NOUN

POS Tagging

POS Tagging uses the same algorithm as Word Sense Disambiguation. Given a HMM trained with a sufficiently large and accurate corpus of tagged words, we can now use it to automatically tag sentences from a similar corpus. Here is the JUnit code snippet to do tag the sentences we used in our previous test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  @Test
  public void testPosTagging() throws Exception {
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat");
    Hmm<ObservationInteger> hmm = hmmTagger.buildFromHmmFile();
    // POS tagging
    String[] testSentences = new String[] {...};
    for (int i = 0; i < testSentences.length; i++) {
      System.out.println("Original sentence: " + testSentences[i]);
      System.out.println("Tagged sentence: " + 
        hmmTagger.tagSentence(testSentences[i], hmm));
    }
  }

And here are the results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
Original sentence: The dog ran after the cat .
Tagged sentence: the/ADJECTIVE dog/NOUN ran/VERB after/OTHER 
  the/ADJECTIVE cat/NOUN ./OTHER 

Original sentence: Failure dogs his path .
Tagged sentence: failure/NOUN dogs/NOUN his/OTHER path/NOUN ./OTHER 

Original sentence: The cold steel cuts through the flesh .
Tagged sentence: the/ADJECTIVE cold/ADJECTIVE steel/NOUN cuts/NOUN 
  through/OTHER the/ADJECTIVE flesh/NOUN ./OTHER 

Original sentence: He had a bad cold .
Tagged sentence: he/OTHER had/VERB a/ADJECTIVE bad/ADJECTIVE cold/NOUN 
  ./OTHER 

Original sentence: He will catch the ball .
Tagged sentence: he/OTHER will/VERB catch/VERB the/ADJECTIVE ball/NOUN 
  ./OTHER 

Original sentence: Salmon is the catch of the day .
Tagged sentence: salmon/NOUN is/VERB the/ADJECTIVE catch/NOUN of/OTHER 
  the/ADJECTIVE day/NOUN ./OTHER 

Sentence Likelihood

HMMs can be used to predict if one sentence is more likely to occur than another one, by comparing the observation probability of a certain sequence of words with another sequence. So for example, we find that the HMM believes that sentences spoken by Master Yoda of Star Wars fame are less likely to occur in "normal" English than sentences expressing similar meaning that you or I would speak.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
  @Test
  public void testObservationProbability() throws Exception {
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat");
    Hmm<ObservationInteger> hmm = hmmTagger.buildFromHmmFile();
    System.out.println("P(I am worried)=" + 
      hmmTagger.getObservationProbability("I am worried", hmm));
    System.out.println("P(Worried I am)=" +  
      hmmTagger.getObservationProbability("Worried I am", hmm));
  }

As expected, our results indicate that the HMM understands us better than it understands Master Yoda.

1
2
P(I am worried)=5.446081633660202E-11
P(Worried I am)=1.2623833954125002E-11

Unsupervised Learning

The final problem we can solve with a HMM is to build one from a set of untagged data. This HMM can then be used for solving the Sentence Likelihood problem, but not the POS Tagging or the WSD problems. To set this up, I picked up a bunch of of Yoda quotes from this page and fed it into a newly instantiated HMM. I then took the same two sentences and asked the HMM which was more probable. Here is the test code snippet to do that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  @Test
  public void testTeachYodaAndObserveProbabilities() throws Exception {
    List<String> sentences = Arrays.asList(new String[] {
      "Powerful you have become .",
      "The dark side I sense in you .",
      "Grave danger you are in .",
      "Impatient you are .",
      "Try not .",
      "Do or do not there is no try .",
      "Consume you the dark path will .",
      "Always in motion is the future .",
      "Around the survivors a perimeter create .",
      "Size matters not .",
      "Blind we are if creation of this army we could not see .",
      "Help you I can yes .",
      "Strong am I with the force .",
      "Agree with you the council does .",
      "Your apprentice he will be .",
      "Worried I am .",
      "Always two there are .",
      "When 900 years you reach look as good you will not ."
    });
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    Hmm<ObservationInteger> learnedHmm = hmmTagger.teach(sentences);
    System.out.println("P(Worried I am)=" +  
      hmmTagger.getObservationProbability("Worried I am", learnedHmm));
    System.out.println("P(I am worried)=" + 
      hmmTagger.getObservationProbability("I am worried", learnedHmm));
  }

Now, as you can see, this new HMM understands Yoda better than it understands us :-).

1
2
P(Worried I am)=4.455273233553778E-6
P(I am worried)=2.4569521508568634E-6

Conclusions

Personally, this learning curve was quite a steep one for me. The theory was fairly easy to grasp from an intuitive standpoint, but then understanding how to model the POS tagging problem as a HMM took me a while. Once I crossed that hurdle, it took me a fair bit of effort to figure out how to use Jahmm to build and solve a HMM.

I think it was worth it, though. HMMs are a very powerful modeling tool for text mining, and can be used to model a variety of real life situations. Using a library such as Jahmm means that you just have to figure out how to model your problem and to solve it using the tools provided.

Hopefully, if you've been reading this far, and you started out not knowing or with a vague idea of what an HMM was and how it could be used for POS tagging (as was my situation couple of months ago), this post has provided some information as well as an example of using the Jahmm API to build and solve an HMM.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

44 comments (moderated to prevent spam):

Anonymous said...

great application, well explained.
I am student in France in computing and language analysis.

I understand your program but i don't really see how are files like brown_tags.txt or brown_dict.txt ...

I would like see an example if it's possible.

Thanks for all.

Sujit Pal said...

Hi, thanks for the kind words, and I am glad you liked the post. The brown_tags.txt is basically a flat file version of the table which I got from the web page referenced, and the brown_dict.txt is the dictionary that I got off the NTLP project.

Kishor Shamra said...

Hi Paul ,
Great application .......I was also trying to use the application but i could n't found brown_dict.txt file anywhere can u please help me??

Kishor Shamra said...

gr8 explanation!!!!!! very good blog, I was trying to develop the same application but can't find the file brown_dict.txt brown_tags.txt
can u please help me on this??

Sujit Pal said...

Hi sharma, thanks for your appreciation. I got the brown_dict.txt file from the NLTK Project, and the brown_tags.txt is a flat file version of the table in this page.

Anonymous said...

Hello,I am a student from China,I wanna say it's a great application,and now i am trying to use the application,but there is an error,i can't find the class POS,could you help me?
Thank you!

Sujit Pal said...

Thank you. Are you trying out the code in this post? I did not see any reference to the POS class in the code here. I have used the POS class elsewhere, its from the MIT Java-Wordnet Interface jar file (jwi-*.jar). Google for the full name and you will find the project page and thence the jar file.

allanctan said...

Hi Sujit,

Just noticed in HMMTagger.java line#92-#97... the second loop (j) is not necessary?

Sujit Pal said...

Thanks, you are right...removing this in the svn version and checking in.

Anonymous said...

Hi. I'm also using Jahmm but I'm encountering a problem. I want to make a voice recognition
software and use HMM to classify the data.

An example of the feature vector(training data) I used is like this:

739.095423 -68.9217791 -41.94023848 190.343314 70.14624796
19.4320868 101.1893881 60.00366733 1.083977616 31.19825832
44.07300049 19.78601438

To learn the HMMs, my code is:

public Hmm learnBleh(List>
sequences){
int numberOfHiddenStates = 12;
do{

KMeansLearner kml = new
KMeansLearner(numberOfHiddenStates, new
OpdfGaussianFactory(), sequences);

trainedHmm = kml.learn();
BaumWelchLearner bwl = new BaumWelchLearner();
bwl.setNbIterations(20);
trainedHmm = bwl.learn(trainedHmm, sequences);
numberOfHiddenStates++;
}while(Double.isNaN(trainedHmm.getPi(0)) && numberOfHiddenStates
<50);

return trainedHmm;
}

How come I get an error:
Exception in thread "main" java.lang.IllegalArgumentException: Variance
must be positive
at
be.ac.ulg.montefiore.run.distributions.GaussianDistribution.(Gaussian
Distribution.java:43)
at
be.ac.ulg.montefiore.run.jahmm.OpdfGaussian.fit(OpdfGaussian.java:124)
at
be.ac.ulg.montefiore.run.jahmm.OpdfGaussian.fit(OpdfGaussian.java:93)
at
be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner.learnOpdf(KMeansLearner.
java:165)
at
be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner.iterate(KMeansLearner.ja
va:67)
at
be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner.learn(KMeansLearner.java
:96)

I don't know what the error means. Have you ever encountered an error like this? Thanks.
All help is greatly appreciated!

Ashwin Jayaprakash said...

Hi Sujit, I got to read and appreciate this old post only today. Nice work, especially the "Yoda speak" example :)

BTW, are you following the Mahout project? It looks like something you'd enjoy.

Ashwin.

Sujit Pal said...

@Anonymous: I think I have seen this error before, but not sure about the solution anymore, this was a while ago. I will try to find out what I did or if I can reproduce this, and get back.

@Ashwin: Thanks, I /am/ following the Mahout project, sort of. Last I looked it was too early stage unless one is a developer on that project, but I know that there has been some activity recently - I should probably look at it again. Thanks for the reminder.

Zoloo said...

Hello Sujit. It is very helpful to me. But I have a question? Which is it bi-gram or tri-gram.

Thanks.

Sujit Pal said...

Thanks Zoloo. Not sure about the question though?

Barnabas said...

Hi, Sujit. Thanks for your amazingly good explanation ^^ it was really helpful to me.

I'm now focusing on Named Entity Recognition problem, and I want to use HMM (with Jahmm library) to solve this problem. I made training corpus which is Named Entity marked by hand, so that I can use it as a training data. But the thing is that, how to train it automatically.
I've looking for training method in Jahmm, but there are only 2 way to train, BaumWelchLearner and KMeansLearner, and they only gets Observation Sequences as a parameter.
For me, this is the problem. Because, for the training, I want to mark Named Entity Type to the words when process the training.

This is long and boring question, sorry about this.;;

In summary, can I train to make all Hmm variables (pi, transition prob. and observation prob. ) automatically using internal method of Jahmm ? if not, how can I train them without making this Hmm variables with my own hand.

Thanks for all, in advance. ^^;

Sujit Pal said...

Thanks Barnabas. Unfortunately I don't think I know enough to answer your question. From what I understand about HMMs, you need a set of observation sequences to train it, but I probably have a lot to learn about this stuff. I would suggest asking the author - he has put his code on googlecode, and that has a discussion list as well.

Unknown said...

Thanks for your comment, Sujit.
I've found another good library, mallet (http://mallet.cs.umass.edu/) it supports supervised learning of HMM, CRF, MEMM. By using this library, I will implement NER system. If there's some progress, I'll let you know, thanks ^^

Sujit Pal said...

Hi Barnabas, best of luck, and thanks for the pointer to mallet, definitely want to take a look at it.

Shruti said...

Hey Sujit,
Great article! I was just wondering what you considered as the initial probabilities in the HMM. Do we consider random numbers that add up to 1 or is there some other way to set initial probabilities?

Sujit Pal said...

Thanks Shruti. The initial probabilities were calculated based on the actual occurrences of a word and the POS in the tagged Brown corpus.

Shantanu said...

Nice application. But i can't set the brown corpus...Please help me to build it for bengali langage..

Sujit Pal said...

Hi Shantanu, you got the wrong guy, sorry :-). Even though I am Bengali, I can't read or write the language (well not completely true, but annotating bengali text with grammar is way past my capabilities), spent most of my life outside Bengal.

Swati said...

hi Sujeet...i tried working with the tagger.but when i give my own input file it gives an error.Does it not work with unknown words not present in the dictionary?

Sujit Pal said...

Hi Swati, yes, I believe you need the dictionary to contain one or more occurrences of the words you are seeking to tag. It figures out the POS of such words by looking at the context of these words in your corpus and finding the most likely one.

Deuz said...

Hi Sujit,

Nice blog you have here. Do the source code is still available for download?

Thanks

Sujit Pal said...

Thanks Deuz, and yes, most of the code from this time is checked into jtmt.sf.net. If you don't find it there its probably not available, but you can copy paste the code from the blog post if it helps (line numbers are not included in the paste).

unni said...

sir how can i run your code?

Sujit Pal said...

Hi unni, the post contains JUnit test snippets (search for @Test in this page) that contains to build an HMM model from the Brown corpus, use the model to detect POS for words, and rank sentences in order of plausibility. You can either build a JUnit test class with these, or put the code inside a main() method and execute them.

Anonymous said...

Sir thanks for this great app and tutorial. I wonder what should be the content of hmm_tagger.dat file.

Sujit Pal said...

Hi, I put them all in my JTMT project on sourceforge (link to project at bottom of post) - the hmm_tagger.dat file I used is here. Deleting your other two comments since they are the same thing, sorry about the delay in replying.

Anonymous said...

I tried running your code. but i always get error. Am i doing something wrong? it always fail on the observationpart

Anonymous said...

how do you add words in the brown_dict.txt ??

Sujit Pal said...

Its been a while since I wrote this so can't promise that I will be able to figure it out. Do you get an error message? Also, you may want to download the code from the jtmt.sf.net project SVN repo - they should be the same as whats on the post, but just to be sure.

I think the brown_dict.txt is just the words from the brown_tags.txt file. I built it because JAHMM needed it as a list of words.

Anonymous said...

I did try to add words on the brown_dict file but everything got messed up. Do you add words manually or is there a way to add them using jahmm?

Anonymous said...

how did you create the tagger.dat file ?

Sujit Pal said...

I don't recall, but I am almost certain that I didn't, since the most likely thing I would have done would be to transform the Brown Corpus data into the format JAHMM or JAHMM based code requires. The tagger file is created using HmmTagger, look at the saveToFile method. For how its called take a look at HmmTaggerTest#testBuildFromBrownAndWrite. Take a look at the code, its reasonably well commented, and you should find all your answers in there - its all in the postaggers (src/main/java and src/test/java) package.

Unknown said...

Hi, I liked the idea.
But what if I want to tag any input sentence with more than five tagsets(NOUN, VERB, ADJECTIVE, ADVERB, OTHER)?
Thanks

Sujit Pal said...

Hi Vikram, you will need training data for whatever you want to tag with. In my case the training data comes from the brown corpus which is tagged with the 5 POS tags, so thats why the classification happens only with 5 tags. If you want to tag with more POS tags, you will have to get a training set that has more tags - Penn Treebank has more, but the full corpus is not free.

Unknown said...

Hello, sujit sir
Where can i get documentation of JHMM lib.
I did not find any resources except this.
Thanks

Sujit Pal said...

Hi Vikram, its been a while so don't remember exactly, but I believe I initially used the JAHMM example here then started looking through the source code, specifically the unit tests to see how to do various things.

Anonymous said...

How can I find the unknown word i.e the word which is not in the corpus, are there any algorithms that I can refer or anything else.

Thanks!

Sujit Pal said...

Do you mean predict the probability of an unknown word? There is nothing directly applicable AFAIK, but you could probably use smoothing techniques to include _UNK_ (representative token for unknown words) and then find the probability of _UNK_. But not completely sure if this will work for you.

Lalise said...

I got a problem when I was trying to use a corpus which is tagged (POS) to work on WSD using java. The code perform well when I use untagged corpus. I tried to change the format of the data, by putting '_' , '/' to split the tag name from the word . What should I do now?

Sujit Pal said...

First off, apologies for the delayed answer. In case you haven't figured this out yourself or found another way to solve your problem by now, fix is to replace "/" in BrownCorpusReader.java line 84 to "_".