Wednesday, April 30, 2014

Finding Temporality of Sentences with TempoWordNet


In my previous post, I described a Scala implementation of the Negex algorithm, that was initially developed to detect negation of phrases in a sentence using manually built pre- and post- trigger terms. Later on, it was found that the same algorithm could be used to also detect temporality and experiencer characteristics of a phrase in a sentence, using a different set of pre- and post- trigger terms.

I also recently became aware of the TempoWordNet project from a group at Normandie University. The project provides a free lexical resource that contains each synset of Wordnet marked up with its probability of being past, present, future or atemporal. This paper describes the process by which these probabilities were generated. There is another paper which one of the authors referenced on LinkedIn, but its unfortunately behind an ACM paywall, and I am no longer a member starting this year, so could not read it.

In running the Negex algorithm against the annotated list that comes with it, I found that the Historical and Hypothetical annotations had lower accuracies (0.90 and 0.89 respectively) compared to the other two (Negation and Experiencer). Thinking about it, I realized that the Historical and Hypothetical annotators are a pair of binary annotators used to annotate a phrase into one of 3 classes - Historical, Recent and Not Particular. With this understanding, some small changes in how I measured the accuracy brought them up to 0.93 and 0.99 respectively. But I figured that may be possible to also compute temporality using the TempoWordNet file, similar to how one does classical sentiment analysis. This post describes that work.

Each synset in the TempoWordNet file is written as a triple of word, part of speech and synset ID. From this, I build a LingPipe ExactDictionaryChunker for each word/POS pair for each temporality state. I check to see that I only capture the first synset ID for the word/POS combination, so hopefully I capture the probabilities for the most dominant synset (the first one). I have written earlier about the LingPipe ExactDictionaryChunker, it implements the Aho-Corasick string matching algorithm which is very fast and space efficient.

Each sentence is tokenized into words and POS tagged (using OpenNLP's POS tagger. Each word/POS combination is matched against each of the ExactDictionaryChunkers and the probability of matching word for each tense summed across all words. The class for which the sum of probabilities of individual words is the highest is the class of the sentence. Since OpenNLP uses the Penn Treebank tags, they need to be translated to Wordnet tags before matching.

Here is the code for the TemporalAnnotator.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
// Source: src/main/scala/com/mycompany/scalcium/transformers/TemporalAnnotator.scala
package com.mycompany.scalcium.transformers

import java.io.File
import java.util.regex.Pattern

import scala.Array.canBuildFrom
import scala.collection.JavaConversions.asScalaIterator
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import com.aliasi.chunk.Chunker
import com.aliasi.dict.DictionaryEntry
import com.aliasi.dict.ExactDictionaryChunker
import com.aliasi.dict.MapDictionary
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory
import com.mycompany.scalcium.utils.Tokenizer

class TemporalAnnotator(val tempoWNFile: File) {

  val targets = List("Past", "Present", "Future")
  val chunkers = buildChunkers(tempoWNFile)
  val tokenizer = Tokenizer.getTokenizer("opennlp")
  
  val ptPoss = List("JJ", "NN", "VB", "RB")
    .map(p => Pattern.compile(p + ".*"))
  val wnPoss = List("s", "n", "v", "r")
  
  def predict(sentence: String): String = {
    val scoreTargetPairs = chunkers.map(chunker => {
      val taggedSentence = tokenizer.posTag(sentence)
        .map(wtp => wtp._1.replaceAll("\\p{Punct}", "") + 
          "/" + wordnetPos(wtp._2))
        .mkString(" ")
      val chunking = chunker.chunk(taggedSentence)
      chunking.chunkSet().iterator().toList
        .map(chunk => chunk.score())
        .foldLeft(0.0D)(_ + _)
      })
      .zipWithIndex
      .filter(stp => stp._1 > 0.0D)
    if (scoreTargetPairs.isEmpty) "Present"
    else {
      val bestTarget = scoreTargetPairs
        .sortWith((a,b) => a._1 > b._1)
        .head._2
      targets(bestTarget)
    }
  } 
  
  def buildChunkers(datafile: File): List[Chunker] = {
    val dicts = ArrayBuffer[MapDictionary[String]]()
    Range(0, targets.size).foreach(i => 
      dicts += new MapDictionary[String]())
    val pwps = scala.collection.mutable.Set[String]()
    Source.fromFile(datafile).getLines()
      .filter(line => (!(line.isEmpty() || line.startsWith("#"))))
      .foreach(line => {
        val cols = line.split("\\s{2,}")
        val wordPos = getWordPos(cols(1))
        val probs = cols.slice(cols.size - 4, cols.size)
          .map(x => x.toDouble)
        if (! pwps.contains(wordPos)) {
          Range(0, targets.size).foreach(i => 
            dicts(i).addEntry(new DictionaryEntry[String](
              wordPos, targets(i), probs(i))))
        }
        pwps += wordPos
    })
    val chunkers = new ArrayBuffer[Chunker]()
    dicts.map(dict => new ExactDictionaryChunker(
        dict, IndoEuropeanTokenizerFactory.INSTANCE, 
        false, false))
      .toList
  }
  
  def getWordPos(synset: String): String = {
    val sscols = synset.split("\\.")
    val words = sscols.slice(0, sscols.size - 2)
    val pos = sscols.slice(sscols.size - 2, sscols.size - 1).head
    words.mkString("")
      .split("_")
      .map(word => word + "/" + (if ("s".equals(pos)) "a" else pos))
      .mkString(" ")
  }  
  
  def wordnetPos(ptPos: String): String = {
    val matchIdx = ptPoss.map(p => p.matcher(ptPos).matches())
      .zipWithIndex
      .filter(mip => mip._1)
      .map(mip => mip._2)
    if (matchIdx.isEmpty) "o" else wnPoss(matchIdx.head)
  }
}

As you can see, I just compute the best of Past, Present and Future and ignore the Atemporal probabilities. I had initially included it as well, but accuracy scores on the Negex annotated test data was coming out at 0.89. Changing the logic to only look at Past and flag it as Historical if the sum of past probabilities of matching sentences was greater than 0 got me an even worse accuracy of 0.5. Finally, after a bit of trial and error, removing the Atemporal chunker resulted in an accuracy of 0.904741, so thats what I stayed with.

Here is the JUnit test for evaluating the TemporalWordnetAnnotator using the annotated list of sentences from Negex. Our default is "Recent", and only when we can confidently say something about the temporality of the sentence, we will change to either "Historical" or "Recent". Our annotators will return a score for each of Past, Present and Future. If the result is "Past" from our annotators, it will be converted to "Historical" for comparison.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Source: src/test/scala/com/mycompany/scalcium/transformers/TemporalAnnotatorTest.scala
package com.mycompany.scalcium.transformers

import java.io.File

import scala.io.Source

import org.junit.Test

class TemporalAnnotatorTest {

  val tann = new TemporalAnnotator(
    new File("src/main/resources/TempoWnL_1.0.txt"))
  val input = new File("src/main/resources/negex/test_input.txt")
  
  @Test
  def evaluate(): Unit = {
    var numTested = 0
    var numCorrect = 0
    Source.fromFile(input).getLines().foreach(line => {
      val cols = line.split("\t")
      val sentence = cols(3)
      val actual = cols(5)
      if ((! "Not particular".equals(actual))) {
        val predicted = tann.predict(sentence)
        val correct = actual.equals(translate(predicted))
        if (! correct) {
          Console.println("%s|[%s] %s|%s"
            .format(sentence, (if (correct) "+" else "-"), 
              actual, predicted))
        }
        numCorrect += (if (correct) 1 else 0)
        numTested += 1
      }
    })
    Console.println("Accuracy=%8.6f"
      .format(numCorrect.toDouble / numTested.toDouble))
  }
  
  /**
   * Converts predictions made by TemporalAnnotator to
   * predictions that match annotations in our testcase.
   */
  def translate(pred: String): String = {
    pred match {
      case "Past" => "Historical"
      case _ => "Recent"
    }
  }
}

This approach gives us an accuracy of 0.904741, which is not as good as Negex, but its lack of accuracy is somewhat offset by its ease of use. You can send the entire sentence to the annotator, no need (as in the case of Negex) to concept map the sentence before sending so it identifies the "important" phrases.

Be the first to comment. Comments are moderated to prevent spam.