Friday, September 05, 2014

Tokenizing and Named Entity Recognition with Stanford CoreNLP



I got into NLP using Java, but I was already using Python at the time, and soon came across the Natural Language Tool Kit (NLTK), and just fell in love with the elegance of its API. So much so that when I started working with Scala, I figured it would be a good idea to build a NLP toolkit with an API similar to NLTKs, primarily as a way to learn NLP and Scala but also to build something that would be as enjoyable to work with as NLTK and have the benefit of Java's rich ecosystem.

The project is perenially under construction, and serves as a test bed for my NLP experiments. In the past, I have used OpenNLP and LingPipe to build Tokenizer implementations that expose an API similar to NLTK's. More recently, I have built an Named Entity Recognizer (NER) with OpenNLP's NameFinder. At the recommendation of one of my readers, I decided to take a look at Stanford CoreNLP, with which I ended up building a Tokenizer and a NER implementation. This post describes that work.

The code for the Tokenizer is shown below. The appropriate implementation can be invoked by calling Tokenizer.getTokenizer("stanford") using a factory pattern on the Tokenizer trait.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// Source: src/main/scala/com/mycompany/scalcium/tokenizers/StanfordTokenizer.scala
package com.mycompany.scalcium.tokenizers

import java.util.Properties

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer

import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.trees.Tree
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation

class StanfordTokenizer extends Tokenizer {

  val props = new Properties()
  props("annotators") = "tokenize, ssplit, pos, parse"
  val pipeline = new StanfordCoreNLP(props)

  override def sentTokenize(para: String): List[String] = {
    val doc = new Annotation(para)
    pipeline.annotate(doc)
    doc.get(classOf[SentencesAnnotation])
      .map(coremap => coremap.get(classOf[TextAnnotation]))
      .toList
  }
  
  override def wordTokenize(sentence: String): List[String] = {
    val sent = new Annotation(sentence)
    pipeline.annotate(sent)
    sent.get(classOf[SentencesAnnotation])
      .head
      .get(classOf[TokensAnnotation])
      .map(corelabel => corelabel.get(classOf[TextAnnotation]))
      .toList
  }
  
  override def posTag(sentence: String): List[(String,String)]= {
    val sent = new Annotation(sentence)
    pipeline.annotate(sent)
    sent.get(classOf[SentencesAnnotation])
      .head
      .get(classOf[TokensAnnotation])
      .map(corelabel => {
        val word = corelabel.get(classOf[TextAnnotation])
        val tag = corelabel.get(classOf[PartOfSpeechAnnotation])
        (word, tag)
      })
      .toList
  }
  
  override def phraseChunk(sentence: String): List[(String,String)] = {
    val sent = new Annotation(sentence)
    pipeline.annotate(sent)
    val tree = sent.get(classOf[SentencesAnnotation])
      .head
      .get(classOf[TreeAnnotation])
    val chunks = ArrayBuffer[(String,String)]()
    extractChunks(tree, chunks)
    chunks.toList
  }
  
  def extractChunks(tree: Tree, chunks: ArrayBuffer[(String,String)]): Unit = {
    tree.children().map(child => {
      val tag = child.value()
      if (child.isPhrasal() && hasOnlyLeaves(child)) {
        // concatenate words into phrase if the children of this
        // phrase are leaves (not phrases themselves)
        val phrase = child.getLeaves[Tree]()
          .flatMap(leaf => leaf.yieldWords())
          .map(word => word.word())
          .mkString(" ")
        chunks += ((phrase, tag))
      } else {
     // dig deeper
     extractChunks(child, chunks)
      }
    })
  }
  
  def hasOnlyLeaves(tree: Tree): Boolean = 
    tree.children().filter(child => child.isPhrasal()).size == 0
}

Most of the calls are straightforward. The only exception is the phraseChunk() method, which was originally built as a wrapper around OpenNLP's shallow phrase chunking. Stanford parser only does deep parsing into a Tree, from which my code extracts a list of phrases and phrase types.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
>>> text = """
Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group
based at Amsterdam. 
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields 
PLC, was named a nonexecutive director of this British industrial conglomerate.
"""
>>> tokenizer = Tokenizer.getTokenizer("stanford")
>>> sentences = tokenizer.sentTokenize(text)
List(Pierre Vinken, 61 years old, will join the board as a nonexecutive 
  director Nov. 29.,
  Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group
  based at Amsterdam.,
  Rudolph Agnew, 55 years old and former chairman of Consolidated Gold 
  Fields PLC, was named a nonexecutive director of this British industrial 
  conglomerate.)
>>> words = tokenizer.wordTokenize(sentences(0))
List(Pierre, Vinken, ,, 61 years, old, ,, will, join, the, board, as,
  a, nonexecutive, director, Nov., 29, .)
>>> postags = tokenizer.posTag(sentences(0))
List((Pierre,NNP), (Vinken,NNP), (,,,), (61,CD), (years,NNS), (old,JJ),
  (,,,), (will,MD), (join,VB), (the,DT), (board,NN), (as,IN), (a,DT),
  (nonexecutive,JJ), (director,NN), (Nov.,NNP), (29,CD), (.,.))
>>> phrases = tokenizer.phraseTokenize(sentences(0))
List(Pierre Vinken, 61 years, the board, a nonexecutive director, Nov. 29)
>>> chunks = tokenizer.phraseChunk(sentences(0))
List((Pierre Vinken,NP), (61 years,NP), (the board,NP), 
  (a nonexecutive director,NP), (Nov. 29,NP-TMP))
>>>

The Stanford CoreNLP based NER follows a similar approach as the Tokenizer, being instantiated by calling NameFinder.getNameFinder("stanford") using a factory pattern on the NameFinder trait. Here is the code.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Source: src/main/scala/com/mycompany/scalcium/names/StanfordNameFinder.scala
package com.mycompany.scalcium.names

import java.io.File
import scala.collection.JavaConversions._
import com.mycompany.scalcium.tokenizers.Tokenizer
import java.util.Properties
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.NormalizedNamedEntityTagAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation

class StanfordNameFinder extends NameFinder {

  val props = new Properties()
  props("annotators") = "tokenize, ssplit, pos, lemma, ner"
  props("ssplit.isOneSentence") = "true"
  val pipeline = new StanfordCoreNLP(props)

  override def find(sentences: List[String]): List[List[(String,Int,Int)]] = {
    sentences.map(sentence => {
      val sent = new Annotation(sentence)
      pipeline.annotate(sent)
      sent.get(classOf[SentencesAnnotation])
        .head
        .get(classOf[TokensAnnotation])
        .map(corelabel => (corelabel.ner(), corelabel.beginPosition(), 
          corelabel.endPosition()))
        .filter(triple => (! "O".equals(triple._1)))
        .groupBy(triple => triple._1)
        .map(kv => {
          val key = kv._1
          val list = kv._2
          val begin = list.sortBy(x => x._2).head._2
          val end = list.sortBy(x => x._3).reverse.head._3
          (key, begin, end)
        })
        .toList
    })
    .toList
  }
}

The previous version of my Stanford based NER used the Stanford NER library and the 4 class classifier model directly. This was definitely an improvement over the OpenNLP NameFinder as described here (scroll down to the end). The code above creates a NER that can recognize 7 classes and uses code very similar to the Tokenizer (although arguably I could have created a 7 class NER by using the appropriate classifier model). Here is some output from the NER.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>>> namefinder = NameFinder.getNameFinder("stanford")
>>> entities = namefinder.find(sentences)
List(List((PERSON,0,13), (DURATION,15,27), (DATE,76,83)),
  List((PERSON,4,10), (MISC,45,50), (LOCATION,77,86), (ORGANIZATION,26,39)),
  List((PERSON,0,13), (MISC,111,118), (DURATION,16,28), (ORGANIZATION,52,80)))
>>> prettyPrint(sentences, entities)
Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29.
  (0,13): Pierre Vinken / PERSON
  (15,27): 61 years old / DURATION
  (76,83): Nov. 29 / DATE
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group based 
at Amsterdam.
  (4,10): Vinken / PERSON
  (45,50): Dutch / MISC
  (77,86): Amsterdam / LOCATION
  (26,39): Elsevier N.V. / ORGANIZATION
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields 
PLC, was named a director of this British industrial conglomerate.
  (0,13): Rudolph Agnew / PERSON
  (111,118): British / MISC
  (16,28): 55 years old / DURATION
  (52,80): Consolidated Gold Fields PLC / ORGANIZATION
>>>

When I last looked at the Stanford parser, I found the API somewhat hard to understand. The CoreNLP API is much simpler and seems more unified, possibly at the cost of some compile time type checking.

Overall, I was quite impressed by Stanford CoreNLP's accuracy. However, performance-wise, Stanford CoreNLP seems to be uniformly slower than either OpenNLP and LingPipe, although not by much (using my limited set of examples). In all fairness, though, Stanford CoreNLP is designed to work in batch mode, where you run the pipeline with the text and then walk through the annotations generated as a result of the Properties object passed in to the constructor.

10 comments (moderated to prevent spam):

Mihai said...

Hi Sujit,
You might be interested in this project:
https://github.com/sistanlp/processors

It provides a simple API for CoreNLP and other NLP and ML tools, including some developed in house.
Best wishes,
Mihai

Sujit Pal said...

Thanks for the pointer Mihai, the project looks quite interesting. I think we both started with the same objectives, but yours is built much better and has reached farther. Definitely lot to learn from your project, thanks again.

Unknown said...

Hi Sujit: Thanks I am new to both scala and nlp and this gave me a great head start. I have a related question, if I have a annotated corpus(biomedica) that I could use to train my model, how could I make use of stanford core nlp. I know the NER tagger only works on specific types of entities.

Sujit Pal said...

Hi Unknown, glad it helped you. I've never tried training with CoreNLP myself, but this CoreNLP FAQ page has some information that may be helpful.

Anonymous said...

Hi Sujit,
would you mind sharing your Tokenizer and NameFinder traits as well?
Thanks,
Tim Malt

Sujit Pal said...

Sure, here they are on github - source codes for Tokenizer and NameFinder.

Anonymous said...

sir can you give source code to find location,date,person.its possible to do in c#asp.net.if you have please share ..

Sujit Pal said...

Hi Sheela, sorry I don't have code in C#/ASP.net for this, but there is also a Stanford CoreNLP for C# library (just found this out), I am guessing that the approach to finding named entities would be similar in both libraries (and Java and C# are also quite similar), so maybe you could use my code as an example of how to do this in C#?

Madhav said...

Hi Sujit, I am trying to use Stanford CoreNLP with scala and ended at your blog.
One thing I didn't understood is how are you able to use Scala collection functions like (head, map etc)
since Stanford CoreNLP is written in java and gives back java collection.
Am i missing anything here ?

Sujit Pal said...

I am using the JavaConversions implicits. They allow you to interoperate between Java and Scala collections (most of the time). You can import the implicits using import scala.collection.JavaConversions._.