Saturday, June 23, 2007

PyLucene: Python scripting for Lucene

I started learning Python about 3 years ago, and since then I have been trying to adapt it for all my scripting needs. Since I mostly do Java programming, I am not exactly what you would call a hardcore Python programmer. I find myself using Python mostly for database reporting, converting files of data from one format to another, etc. There have been times in the past when I would have to report on a Lucene index, or do some post-processing on an existing index to inject special one-off values on an index created by our index building pipeline, but my approach had been to simply write a Java program to do this. Since I dislike running Java programs from the command prompt (mainly because I have to write a shell script that sets the CLASSPATH), I end up writing a JUnit unit test to run the code. A lot of work, I know, but thats what I had to work with then.

I had read about PyLucene in the Lucene in Action book, but hadn't had the opportunity to actually download it and take it for a spin. This opportunity came up recently, and I am happy to report that installing and working with PyLucene was relatively painless and quite rewarding. In this post, I explain how I installed PyLucene on my Linux box and show two little scripts that I converted over from Java. From what I have seen, PyLucene has a strong following, but unlike me, these guys actually use PyLucene to build full fledged applications, not just little one-off scripts. Hopefully, once you see how simple it is, you will be encouraged to use it, even if you use a language such as Java or C# for mainline development.

PyLucene installation (Fedora Core 4 Linux)

The installation is relatively straightforward, but the instructions are not very explicit. I was trying to install on a box running Fedora Core 4 Linux, and there is no RPM package. Neither is there a package that can be installed by the standard "configure, make, make install" procedure. Seeing no pre-built packages for my distribution, I initially attempted to install from source, but ran into strange prompts that I could not answer, so I tried downloading the Unix binary distribution instead. I ended up copying the files from the binary distribution to my filesystem according to the README file included in this distribution.

1
2
3
4
5
6
sujit@sirocco:~/PyLucene-2.0$ ls
CHANGES  CREDITS  python  README  samples  test
sujit@sirocco:~/PyLucene-2.0$ cd python
sujit@sirocco:~/PyLucene-2.0/python$ ls
PyLucene.py  _PyLucene.so  security
sujit@sirocco:~/PyLucene-2.0/python$ cp -R /usr/lib/python-2.4/site-packages

Basically, I copied all the files under the python subdirectory of the downloaded binary distribution to my Python site-packages directory. That was the end of the installation.

To test this module, I decided to port the two Java programs I had written to do the simple index reporting and post-processing I spoke of earlier. Not only did they end up taking fewer lines of code to write, they are also at the right level of abstraction, since these things really deserve to be scripts. I also ended up setting up the groundwork to be able to build quick and dirty scripts to access and modify Lucene databases, just like I have for databases.

Script to report on crawled URLs in an Index

The script below just opens up an index whose directory is supplied on the command line, and returns a pipe-delimited report (which currently goes to stdout) of title and url. This can be useful for testing, since you will now know what kind of search term to enter for these indexes to come back with results. It can also be useful for verifying that we crawled the sites we were supposed to crawl.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/usr/bin/python
# Takes an index directory from the command line and produces a pipe
# delimited report of title and URL from the index.
import sys
import string
from PyLucene import IndexSearcher, StandardAnalyzer, FSDirectory

def usage():
  print " ".join([sys.argv[0], "/path/to/index/to/read"])
  sys.exit(-1)

def main():
  if (len(sys.argv) != 2):
    usage()
  path = sys.argv[1]
  dir = FSDirectory.getDirectory(path, False)
  searcher = IndexSearcher(dir)
  analyzer = StandardAnalyzer()
  numdocs = int(searcher.maxDoc())
  print "#-docs:", numdocs
  for i in range(1, numdocs):
    doc = searcher.doc(i)
    title = doc.get("title")
    url = doc.get("url")
    print "|".join([title.encode('ascii', 'replace'), url])
  searcher.close()

if __name__ == "__main__":
  main()

Script to inject additional precomputed data

This script takes a pre-built index as input and injects an additional field in some of the records depending on the URL. This can be useful if you set up your url field to be storable but do not tokenize it, so you may want to post process the index to match the URLs against one or more patterns and add in another facet field which you can then query on. In this case, the facet is set up as Index.UN_TOKENIZED so our application code will have to specify the exact facet its looking for.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/usr/bin/python
# Copies the index whose source directory is specified and copies it after
# transformations to the specified target directory. In this case, it looks
# at the URL and adds in a facet field.
import sys
import string
from PyLucene import IndexSearcher, IndexWriter, StandardAnalyzer, FSDirectory, Field

def usage():
  print " ".join([sys.argv[0], "/path/to/index/source", "/path/to/index/target"])
  sys.exit(-1)

def main():
  if (len(sys.argv) != 3):
    usage()
  srcPath = sys.argv[1]
  destPath = sys.argv[2]
  srcDir = FSDirectory.getDirectory(srcPath, False)
  destDir = FSDirectory.getDirectory(destPath, True)
  analyzer = StandardAnalyzer()
  searcher = IndexSearcher(srcDir)
  writer = IndexWriter(destDir, analyzer, True)
  numdocs = int(searcher.maxDoc())
  for i in range(1, numdocs):
    doc = searcher.doc(i)
    title = doc.get("title")
    url = doc.get("url")
    if (url.find("pattern1") > -1):
      doc.add(Field("facet", "pattern1", Field.Store.YES, Field.Index.UN_TOKENIZED))
    writer.addDocument(doc)
  searcher.close()
  writer.optimize()
  writer.close()

if __name__ == "__main__":
  main()

In both cases, the code should look familiar if you have worked with Lucene before. It is really the same Java classes wrapped up to be accessible through Python, so the only difference is the more compact Pythonic syntax. The one caveat is that PyLucene uses Lucene 1.4, whereas most Lucene shops are probably up at 2.0 or 2.1 (if you want to be on the bleeding edge). However, for one off scripts, the version difference should not make a difference most of the time, unless you are trying to use one of the newer features in your Python code.

Adding your own Analyzer to Luke

On a kind of related note, I was able to add Analyzers to my Luke application. I know support exists for this, and most Lucene programmers probably know how to do this already, but since there is no clear instructions on how to do this, I figured I'd write it up here. It's not hard once you know how. The standard shell script invocation for Luke is:

1
2
#!/bin/bash
java -jar $HOME/bin/lukeall-0.7.jar

I was experimenting with the Lucene based spell checker described in the Java.net: Did You Mean: Lucene? article, and I wanted to use the SubwordAnalyzer within Luke. Luke comes with a pretty comprehensive set of Analyzer implementations, but this one was not one of them. So I changed the script above to include the jar file that contained this class, along with its dependencies (such as commons-lang, commons-io, etc), and changed the java call to use -cp instead. Here is my new script to call Luke.

1
2
3
4
5
6
7
#!/bin/bash
M2_REPO=$HOME/.m2/repository
export CLASSPATH=$HOME/projects/spellcheck/target/spellcheck-1.0-SNAPSHOT.jar:\
  $M2_REPO/log4j/log4j/1.2.12/log4j-1.2.12.jar:\
  $M2_REPO/commons-io/commons-io/1.2/commons-io-1.2.jar:\
  $M2_REPO/commons-lang/commons-lang/2.2/commons-lang-2.2.jar
java -cp lukeall-0.7.jar:$CLASSPATH org.getopt.luke.Luke

And now I can use the SubwordAnalyzer from within Luke to query an index which used this analyzer to build an index out of a list of English words.

2 comments (moderated to prevent spam):

Bob said...

i've inherited an old lucene index and i want to do some ML on it using mahout, but the index wasn't originally created with TermVector set to "YES" on the relevant fields.

do you think it's possible to script via pylucene and retroactively add term vectors to each document?

Sujit Pal said...

Probably not. The problem with Lucene is that its a WORM (Write Once Read Many) datastore. I am guessing you would be interested in setting term vectors to the content field, and generally this is created as unstored so only the terms are stored. However, if all the fields you care about are stored, you can copy these fields out data over to a new index and set term vectors on this index.