Saturday, December 22, 2007

Django : First Impressions

In my last post, I spoke of my experience with Pylons, a Python rapid web application development framework similar to Ruby on Rails, and drew parallels with popular Java frameworks I have used in my development work. An anonymous commenter expressed surprise at one of my comments where I said I considered TurboGears and Pylons top contenders in this space, and pointed out that Django has more people on its mailing lists and a longer list of deployed sites in production.

First off, let me say that I am no expert in Python frameworks. I happen to like the Python language, and use it for all my scripting work. I use Java for web development, and I picked up PHP again when I built a small application recently. However, I have admired Ruby on Rails for a while, but was too lazy to learn the Ruby language, so I was really looking for something similar in Python-land.

So anyway, when I looked at Django the last time around, I thought it looked kind of (too) optimized for developing database driven web applications. Also both TurboGears and Pylons leverage external projects for part of their functionality, so as a Java programmer, that approach appeared more attractive to me. However, I realize that I may have been kind of hasty in my comment, so I decided to download Django and run through the tutorial (which is a database driven app, but at least I would know more about it than I did when I wrote the comment.

The installation was completely painless. I downloaded the tarball for the latest official release (0.96.1) from the site, then ran

python setup.py install
to install it. I then followed a very informative 4 page tutorial to build a poll system with a MySQL backend. This includes a very pretty and functional admin interface, along with the user facing application itself. I won't bore you with the details, except to say that the tutorials are quite extensive and very detailed, and all the commands work as advertised.

Running

django-admin.py startproject ${project_name}
creates an empty project template for you. The template is sparser than a Rails project, it consists of just 4 files in the project directory. The manage.py is the super-script (similar to the Rails scripts in the scripts directory) that you will use to generate and run almost everything, which delegates to the manage.py program in Django's installation.

To run the application for development, Django provides a runserver subcommand for manage.py (

python manage.py runserver
) which instantiates a HTTP server at port 8000.

Application wide configuration is through a single settings.py file. Sub-applications in the webserver are stored as subdirectories, which are accessed through Django code as Python packages. Each such package can have its own settings.py file for configuring that package. To create a package, run the command

python manage.py startapp ${packagename}
, this will create a subdirectory with the skeleton code for models.py and views.py.

Django follows the MVC paradigm, however, the nomenclature used is somewhat different. Database tables are mapped directly to models in models.py, and the controller methods are declared in views.py. The actual views are in a template directory that is defined in settings.py. The templates contain HTML and markup in Django's custom templating language. So the MVC is actually MTV in Django.

Django's custom ORM is quite lean, and similar to SQLAlchemy in Pylons, and like RoR, has support for declaring and following 1:1, 1:n, and n:n relationships. Each database table is modelled as a class in the models.py file in the package directory. One sets up the database mappings in the settings.py file, then creates the model classes. Then we need to run

python manage.py syncdb
to create the tables in the database. However, I found the SQLAlchemy API more intuitive.

I particularly liked the Django shell, in which it is possible to interactively play with the models, adding and removing data, to see if everything works. It can be started with

python manage.py shell
at the command line.

The templating language is custom to Django. It contains most of the simple control structures you would expect. However, the impression I got is that the preferred way is to do as much processing as possible in the controller layer (methods in views.py), which is what I would prefer to do in any case.

Overall, I liked Django. It is very similar to RoR, probably more so than TurboGears and Pylons, in the sense that it is a custom full-stack application optimized for database-driven apps. Django's routing appears to be slightly more advanced since regular expressions can be used to specify patterns, but clean URLs can be constructed by both frameworks. Pylons generates a nicer directory structure and uses (in my mind at least) correct nomenclature; however, that is probably just a function of usage.

From browsing the net, it seems that there is a stronger following for Django than for Pylons. Comparisons between TurboGears (and by extension, Pylons, which is very similar) and Django seem to favor the latter. However, there seems to be a dedicated following for Pylons as well. I found both Pylons and Django to be equally effective for my purposes, although I found Pylons to be more Java like, and thus more intuitive. Both offer functionality similar to RoR, which is great news for Ruby-challenged Python programmers like me.

And to the anonymous commenter, thanks very much for pointing me to Django.

Monday, December 17, 2007

First Steps with Pylons

I have written previously about how impressed I was with Ruby on Rails (RoR) when I first came across it about a year or so ago. For me, it represented a shift in the way web applications were designed - rather than build an individual application from scratch, the framework generates a template which can be extended as needed for the given application. Obviously, convention over configuration is an important aspect of this approach, since the template code will expect to find files with the "right" name and in the "right" location.

However, I did not know the Ruby scripting language, and in spite of my best efforts, I haven't been able to muster up enough interest to take the time and learn it. For one, Maven 2.x offers a similar (although not as complete) functionality to build application archetypes, and since I mostly do Java web application development for a living, it seemed a better investment of my time to work with Maven2 and Java. Second, although RoR seems to have impressive mindshare among the web literati, my personal experience is that customers expect a mid to large scale apps to be in Java. They usually don't have an opinion about smaller scale (3-10 page) apps, but I find PHP far simpler, more natural, and more than up to the task for building one of these. So there is not enough incentive for me to learn a language in order to use a framework that I will hardly ever use.

However, PHP uses the "code-in-page" approach, which is not easy to maintain once the application grows beyond a certain size. What I needed was a MVC framework based on a scripting language I already knew and enjoyed working in. Happily, when looking recently, I found not one, but three popular open source MVC frameworks based on Python - Django, TurboGears and Pylons.

Of these, TurboGears and Pylons are more glue-like, assembling the framework from a variety of other open source sub-frameworks. It is also possible to switch out some of the sub-frameworks and replace it with another if its a better fit for the application or the programming environment.

I started playing with TurboGears, since that is the more mature (and more stable, I hoped) framework of the two. Unfortunately, I could not get tg-admin to build my application template, because there were some modules mismatches with Python 2.5. After spending an afternoon trying to debug the problem, I gave up and tried Pylons. Installation was a breeze, partly because I already had some of the components installed for TurboGears, but there were absolutely no surprises, and everything worked as expected. I was able to get through the Getting Started (a Hello World style web application) and the Quick Wiki Tutorials within the next couple of days.

As I mentioned above, Pylons is composed of a number of sub-frameworks, each of which are independent projects in their own right. The Paster component provides commands to build an application template, similar to Maven2's archetype:create command, and to deploy the application as a Python egg (similar to WAR files in Java web applications). It also provides a built in HTTP webserver to test the application, similar to Maven2's Jetty plugin.

For database access, Pylons uses the SQLAlchemy project, which provides a Hibernate like API to access database tables and rows as Python objects. These are called model objects in Pylons. Each table (or logical group of tables) is modelled as a Python class. The SQLAlchemy engine is injected into the environment using a Python call in environment.py. The engine exposes a Session object, similar to Hibernate, from which a Query object is extracted. Method calls on the Query object translate to SQL queries to the database. However, the caller only sees application objects. This approach works great when the application gets to decide the database schema, but would probably be slightly more difficult (although not impossible) when building an application against an existing database. In this respect, it has the same limitations as Hibernate. I still need to delve deeper into this though.

For the page rendering subsystem, Pylons uses the Mako templating framework. Mako templates have markup that seems to be a superset of JSTL-style tags and something like Velocity-style tags. The JSTL style tags work off predefined model objects which are injected by Pylons controllers. Personally, I think its a bit confusing to have two kinds of markup, but I guess it may have just evolved that way. Although I am still new to Pylons, I think this would be one of the candidate subsystems where I may be tempted to look for alternatives.

The controller subsystem is provided by Pylons itself, and it is all straight generated Python code. There is a provided BaseController object which all application controllers must extend. The structure of an application controller seems to be similar in style to a Spring MultiActionController, with each public method mapped to a URL pattern, and representing a user action.

To route user URLs to the appropriate controller method, Pylons uses the Routes project. The default routes are set up an in-memory Python dictionary data structure. Additional application specific mappings are added to this structure using additional map.connect() calls in routing.py. This approach appears to be more powerful than Spring XML configuration, in that it allows regular expressions as well. TurboGears uses @expose annotations within the controller methods to do this, which may or may not be better depending on your point of view.

Thus, for a Java programmer who has worked with Maven2, Spring and Hibernate, getting started with Pylons is likely to be quite easy, probably more so than RoR. Of course, it would help if you already knew some Python. I think the relative popularity of RoR among Java programmers is because a lot of them would list Perl as their favorite scripting language, and Perl to Ruby is not that much of a stretch. Perl used to my favorite scripting language at one point, but ever since I began to use Python, reading and writing Perl code is less fun than it used to be. Python is very Java like in a lot of respects, although it is much more concise.

I think that another reason for the relative obscurity of Python based MVC frameworks is the availability of too many choices. Given that Python's philosophy is that "there should be one — and preferably only one — obvious way to do it", I can see how multiple frameworks could be anathema to Python programmers. However, jokes apart, this does dilute the focus of any serious efforts at popularizing a framework. Unlike RoR, there is no one "right" way to do rapid MVC development in Python. However, this is changing, and the two top contenders (in my opinion) in this space are TurboGears and Pylons. The two are very similar, although each glues together different sets of sub-projects. As Ian Bicking states:

I'd have a hard time describing the technical differences (between TurboGears and Pylons) in a meaningful way to someone who didn't already know something about Python web development.
However, TurboGears 2.x will be based on Pylons, so soon Python web developers will have one uber MVC framework to work with, possibly composed of the best sub-projects from either framework. Of course, in keeping with the "glue" philosophy, I expect that the programmer would be able to switch a component out with another one in the same category if required.

Wednesday, December 05, 2007

JMS Patterns with ActiveMQ

Quite some time ago, before I started this blog, I used to have a Geocities (now Yahoo!) homepage where I would write articles about stuff I did, much like this blog here. One such article, "Java Patterns for MQ Series Clients", described various types of clients that could be written against a MQ-Series server. Unlike most of the other stuff I wrote there, it proved to be somewhat popular, a couple of people actually liked it enough to send me email about it.

I was working on a Java/MQ-Series project at the time, and I wrote it as part of learning the MQ-Series Java API. Although the JMS specification was out, and a JMS wrapper around the MQ-Series API was also available, it was still quite new and relatively untested, so a decision was made to code directly against the MQ-Series Java API (which itself was a wrapper over the MQ-Series C API) instead. Since then, no JMS project has come my way, so, even though I had read about the JMS API and was familiar with the concepts (having worked on its predecessor product), I had never worked with JMS. Since I mostly learn by doing, I figured that it would be instructive to try and code up the same scenarios described in my old article using JMS. This blog post is a result of that effort.

For the server, I chose Apache ActiveMQ, a free and open source JMS implementation. Setting it up was quite simple. All I had to do was download the latest stable binary distribution (4.1.1 in my case), untar it into a directory, then run its startup script to bring up a TCP listener on port 61616.

1
2
3
sujit@sirocco:~/tmp$ tar xvzf apache-activemq-4.1.1.tar.gz
sujit@sirocco:~/tmp$ cd apache-activemq-4.1.1
sujit@sirocco:~/tmp/apache-activemq-4.1.1$ bin/activemq

The diagram below describes my basic setup. There is a server which loops forever, reading messages off a request queue, doing some processing, and writing responses to a response queue. Clients communicate with the server by writing messages to the request queue, and optionally reading messages off the response queue. The point-to-point JMS style is used.

The code for the server is shown below. It reads a (text) message from the request queue, prepends "Processed" to it, and places it on the response queue. The init() method sets up the queues, and the destroy() method tears down the queues.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// MessageServer.java
package com.mycompany.mqpatterns;

import javax.jms.Connection;
import javax.jms.JMSException;
import javax.jms.Message;
import javax.jms.MessageConsumer;
import javax.jms.MessageListener;
import javax.jms.MessageProducer;
import javax.jms.Session;
import javax.jms.TextMessage;
import javax.jms.Topic;

import org.apache.activemq.ActiveMQConnectionFactory;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class MessageServer implements MessageListener {

  private final Log log = LogFactory.getLog(getClass());
  
  private Connection connection;
  private Session session;
  private MessageConsumer consumer;
  private MessageProducer producer;
  
  public void init() throws Exception {
    // set 'er up
    ActiveMQConnectionFactory connectionFactory = 
      new ActiveMQConnectionFactory("tcp://localhost:61616");
    connection = connectionFactory.createConnection();
    session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
    // create our request and response queues
    Topic request = session.createTopic("request.queue");
    Topic response = session.createTopic("response.queue");
    // and attach a consumer and producer to them
    consumer = session.createConsumer(request);
    consumer.setMessageListener(this);
    producer = session.createProducer(response);
    // start your engines...
    connection.start();
  }
  
  public void destroy() throws Exception {
    session.close();
    connection.close();
  }

  public void onMessage(Message message)  {
    try {
      if (message instanceof TextMessage) {
        String messageText = ((TextMessage) message).getText();
        log.debug("Server: Got request [" + messageText + "]");
        Message responseMessage = 
          session.createTextMessage("Processed " + messageText);
        if (message.getJMSCorrelationID() != null) {
          // pass it through
          responseMessage.setJMSCorrelationID(message.getJMSCorrelationID());
        }
        producer.send(responseMessage);
      }
    } catch (JMSException e) {
      log.error(e);
    }
  }
}

On re-reading my original article, I found that there were actually only 3 basic patterns which I had covered, not 5, so I wrote implementations for these 3 patterns, which are described below. A lot of code in the init() and destroy() methods are boiler-plate code, but they just serve to set up and take down the queues.

  • Fire And Forget
  • Pseudo Synchronous
  • Asynchronous with Callback

Fire And Forget

A real-life analog for this would be writing a letter and dropping it in the mail. You (the client) do not really know when and if it will reach its destination, and you do not expect a response or an acknowledgement from the reciever. The code for this is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// FireAndForgetClient.java
package com.mycompany.mqpatterns;

import javax.jms.Connection;
import javax.jms.DeliveryMode;
import javax.jms.MessageConsumer;
import javax.jms.MessageProducer;
import javax.jms.Session;
import javax.jms.Topic;

import org.apache.activemq.ActiveMQConnectionFactory;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class FireAndForgetClient {

  private final Log log = LogFactory.getLog(getClass());
  
  private Connection connection;
  private Session session;
  private MessageProducer producer;
  private MessageConsumer consumer;
  
  public void init() throws Exception {
    // set 'er up
    ActiveMQConnectionFactory connectionFactory = 
      new ActiveMQConnectionFactory("tcp://localhost:61616");
    connection = connectionFactory.createConnection();
    session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
    // create our request and response queues
    Topic request = session.createTopic("request.queue");
    // and attach a consumer and producer to them
    producer = session.createProducer(request);
    producer.setDeliveryMode(DeliveryMode.NON_PERSISTENT);
    // and start your engines...
    connection.start();
  }

  public void destroy() throws Exception {
    session.close();
    connection.close();
  }

  public void sendMessage(String messageText) throws Exception {
    producer.send(session.createTextMessage(messageText));
  }
}

Pseudo Synchronous

For all practical purposes, this client is synchronous. I guess the pseudo is only there to emphasise the fact that it is based on a medium that is inherently asynchronous. A real life analog to this would be a (half-duplex) telephone conversation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// PseudoSynchronousClient.java
package com.mycompany.mqpatterns;

import javax.jms.Connection;
import javax.jms.DeliveryMode;
import javax.jms.JMSException;
import javax.jms.Message;
import javax.jms.MessageConsumer;
import javax.jms.MessageProducer;
import javax.jms.Session;
import javax.jms.TextMessage;
import javax.jms.Topic;

import org.apache.activemq.ActiveMQConnectionFactory;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class PseudoSynchronousClient {

  private final Log log = LogFactory.getLog(getClass());
  
  private Connection connection;
  private Session session;
  private MessageProducer producer;
  private MessageConsumer consumer;
  
  private String response;

  public void init() throws Exception {
    // set 'er up
    ActiveMQConnectionFactory connectionFactory = 
      new ActiveMQConnectionFactory("tcp://localhost:61616");
    connection = connectionFactory.createConnection();
    session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
    // create our request and response queues
    Topic request = session.createTopic("request.queue");
    Topic response = session.createTopic("response.queue");
    // and attach a consumer and producer to them
    producer = session.createProducer(request);
    producer.setDeliveryMode(DeliveryMode.NON_PERSISTENT);
    consumer = session.createConsumer(response);
    // and start your engines...
    connection.start();
  }

  public void destroy() throws Exception {
    session.close();
    connection.close();
  }
  
  public String sendMessage(String messageText) throws Exception {
    try {
      log.info("Client: Send request [" + messageText + "]");
      producer.send(session.createTextMessage(messageText));
      Message response = consumer.receive();
      String responseText = ((TextMessage) response).getText(); 
      log.info("Client: Got response [" + responseText + "]");
      return responseText;
    } catch (JMSException e) {
      log.error("JMS Exception on client", e);
    }
    return response;
  }
}

Asynchronous With Callback

Like the Fire and Forget pattern, this is asynchronous, but unlike it, the client can specify that some action should be triggered when the server is done processing the request. A real-life analog for this case would be mailing a letter with a return receipt. The example I use here is to have an acknowledgement sent back to the client, which the client can store in a local database (an in-memory HashMap in this case). The caller will send a message to the server through the client and get back the message Id. The caller can then check on the status of the message.

The request and response messages are tied together using the JMS Correlation Id. We generate a UUID when the message is sent and put it in the JMS Correlation Id. If the server finds that the incoming message contains a JMS Correlation Id, it passes it through in its response.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
// AsynchronousWithCallbackClient.java
package com.mycompany.mqpatterns;

import java.util.HashMap;
import java.util.Map;
import java.util.UUID;

import javax.jms.Connection;
import javax.jms.DeliveryMode;
import javax.jms.JMSException;
import javax.jms.Message;
import javax.jms.MessageConsumer;
import javax.jms.MessageListener;
import javax.jms.MessageProducer;
import javax.jms.Session;
import javax.jms.TextMessage;
import javax.jms.Topic;

import org.apache.activemq.ActiveMQConnectionFactory;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

public class AsynchronousWithCallback implements MessageListener {

  private final Log log = LogFactory.getLog(getClass());
  
  private Connection connection;
  private Session session;
  private MessageProducer producer;
  private MessageConsumer consumer;

  // to store acknowledgements as they are recieved. Should be replaced 
  // by a more persistent mechanism
  private Map<String,String> messageStatus = new HashMap<String,String>();
  
  public void init() throws Exception {
    // set 'er up
    ActiveMQConnectionFactory connectionFactory = 
      new ActiveMQConnectionFactory("tcp://localhost:61616");
    connection = connectionFactory.createConnection();
    session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
    // create our request and response queues
    Topic request = session.createTopic("request.queue");
    Topic response = session.createTopic("response.queue");
    // and attach a consumer and producer to them
    producer = session.createProducer(request);
    producer.setDeliveryMode(DeliveryMode.NON_PERSISTENT);
    consumer = session.createConsumer(response);
    consumer.setMessageListener(this);
    // and start your engines...
    connection.start();
  }

  public void destroy() throws Exception {
    session.close();
    connection.close();
  }
  
  public String sendMessage(String messageText) throws Exception {
    TextMessage message = session.createTextMessage(messageText);
    String messageId = UUID.randomUUID().toString();
    message.setJMSCorrelationID(messageId);
    producer.send(message);
    return messageId;
  }

  public String getStatus(String correlationId) {
    synchronized(this) {
      if (messageStatus.containsKey(correlationId)) {
        String status = messageStatus.get(correlationId);
        messageStatus.remove(correlationId);
        return status;
      } else {
        return null;
      }
    }
  }
  
  public void onMessage(Message message) {
    synchronized(this) {
      try {
        if (message instanceof TextMessage) {
          String originalMessageId = message.getJMSCorrelationID();
          String responseText = ((TextMessage) message).getText();
          messageStatus.put(originalMessageId, responseText);
        }
      } catch (JMSException e) {
        log.error("JMS Exception encountered on client", e);
      }
    }
  }
}

Test Harness

To test these clients, I set up a JUnit test as follows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// JmsClientPatternsTest.java
package com.mycompany.mqpatterns;

import java.util.ArrayList;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.After;
import org.junit.AfterClass;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

public class JmsClientPatternsTest {

  private final Log log = LogFactory.getLog(getClass());
  
  private static MessageServer server;
  private static String[] messages = StringUtils.split(
    "The quick brown fox jumped over the lazy dog");

  private FireAndForgetClient fireAndForgetClient;
  private PseudoSynchronousClient pseudoSynchronousClient;
  private AsynchronousWithCallbackClient asyncWithCallbackClient;
  
  @BeforeClass
  public static void setUpBeforeClass() throws Exception {
    server = new MessageServer();
    server.init();
  }

  @AfterClass
  public static void tearDownAfterClass() throws Exception {
    server.destroy();
  }

  @Before
  public void setUp() throws Exception {
    fireAndForgetClient = new FireAndForgetClient();
    fireAndForgetClient.init();
    pseudoSynchronousClient = new PseudoSynchronousClient();
    pseudoSynchronousClient.init();
    asyncWithCallbackClient = new AsynchronousWithCallbackClient();
    asyncWithCallbackClient.init();
  }

  @After
  public void tearDown() throws Exception {
    fireAndForgetClient.destroy();
    pseudoSynchronousClient.destroy();
    asyncWithCallbackClient.destroy();
  }

  @Test
  public void testFireAndForgetClient() throws Exception {
    for (String message : messages) {
      fireAndForgetClient.sendMessage(message);
    }
  }
  
  @Test
  public void testPseudoSynchronousClient() throws Exception {
    for (String message : messages) {
      String response = pseudoSynchronousClient.sendMessage(message);
      log.debug("response=" + response);
    }
  }

  @Test
  public void testAsynchronousWithCallbackClient() throws Exception {
    List<String> sentMessageIds = new ArrayList<String>();
    for (String message : messages) {
      String messageId = asyncWithCallbackClient.sendMessage(message);
      sentMessageIds.add(messageId);
    }
    Thread.sleep(2000);
    for (String sentMessageId : sentMessageIds) {
      String response = asyncWithCallbackClient.getStatus(sentMessageId);
      log.debug("response[" + sentMessageId + "]=" + response);
    }
  }
}

Conclusion

Comparing to what I remembered with working with the Java API for MQ-Series, JMS is easier to work with. The API provides an onMessage(Message) strategy interface method to be overriden in both the client and server. With the Java API to MQ-Series, the behavior of tying a listener process to a queue had to be configured in the server. Also the queues had to be manually defined prior to being accessed from client code. It is possible that enterprise setups would follow the same strategy, and I don't know enough about either MQ-Series or ActiveMQ Administration to comment intelligently on that.

Obviously there is much more to JMS than this, but the JMS API is fairly simple and compact and can be easily learned. My take on the complexity of the JMS API is that it is similar to JDBC. You need to learn a few basic things to be able to start coding immediately, and the rest you can learn as you go. I found the simplicity of the API to be very attractive.

However, more than the JMS API itself, what I really like is the idea of neatly decoupling two components by putting a pair of queues in between. There are many applications that inherently lend themselves to this sort of design, and these applications typically scale better than those which don't.

Often, people get the design right, but end up implementing a home grown thread based solution for the application. Since the average Java programmer is generally not very well versed in thread programming, this can lead to a maintenance bottleneck, with only one or two persons in a group who know enough to debug and fix problems in the code. JMS provides the tools by which such designs can be implemented with no thread based application code.

Thus, for a Java programmer, I think there is a high benefit to cost ratio in favor of learning JMS. Not only is it easy to learn, but it opens your eyes to patterns which take advantage of inherent asynchronicity in an application, and provides you tools to very easily implement these patterns.

Saturday, December 01, 2007

Spelling Checker using Lucene

My initial interest in spell checking algorithms started when I had to fix some bugs in the spell checking code at work over a year ago. We use Jazzy, a Java implementation of GNU Aspell, as our spell checking library. Jazzy uses a combination of Metaphone and Levenshtein distance (aka Edit distance) to match a misspelled word to a set of words in its dictionary.

An alternative approach to spell checking is the use of n-grams. Basically you break up your dictionary word into sequences of characters of size n, moving your pointer forward one character at a time, and store it in an index. When the user enters a misspelled word, you do the same thing with his input word, then match the ngrams generated to the ngrams in your dictionary. The idea is that a misspelled word would have only one or two or three characters misspelled, transposed or missing, so by comparing the n-grams, you would get matches to correctly spelled words that are close to it. This is a newer approach to spell checking, which has become feasible with the availability of open source search libraries such as Lucene, since it requires the functionality to tokenize your dictionary words and store them in an index for quick retrieval.

I read about this approach first on Tom White's article "Did you mean Lucene?" on java.net quite some time back, but I never completely understood it then, probably because I was new to Lucene. However, what I found attractive about it was its compactness and performance.

You see, the spell checking on most search sites is surfaced as a one line "Did you mean..." component at the top of your search results. Compared to the actual search results, its pretty much irrelevant, unless its suggestions are way off the mark, at which point it has entertainment value to the user and embarrassment value to the site serving the search results. So you want spell checking code thats quick to maintain and execute, even though its internals are likely to be much more complex than the code serving up the search results.

And besides, as a search company heavily invested in Lucene, it was probably more appropriate for us to be using a Lucene based approach, if one existed and was on par, performance and results-wise, with other approaches.

I recently came across this wiki page on the jakarta-lucene site, where the n-gram approach is more succinctly described. You can download the code implementing this algorithm from the original Bugzilla submission. It is written by Nicolas Maisonneuve (and inspired by David Spencer's code according to the class Javadocs). However, the code is written against the Lucene 1.x API and I was using Lucene 2.2, so it would not compile for me.

However, the algorithm is quite simple and intuitive. Basically, each dictionary word is tokenized and stored in an index with the following fields.

Field Description Example
word The dictionary word cancer
gram3 Multi-field of all 3-grams for the word can, anc, nce, cer
gram4 Multi-field of all 4-grams for the word canc, ance, ncer
start3 The first 3 characters of the word, for edge boosting can
end3 The last 3 characters of the word, for edge boosting cer
start4 The first 4 characters of the word, for edge boosting canc
end4 The last 4 characters of the word, for edge boosting ncer

The search query is a BooleanQuery consisting of all the fields (except word) of the index, OR'd together. If edge boosting is desired (on the assumption that people make fewer mistakes at the start and end of words), then the start* and end* field queries are given a higher boost in the query.

Since I had the algorithm anyway, I decided to rewrite the code using Nicolas's code as a guide. For one thing, quite some time has passed since he wrote it, and now n-gram tokenizers are available from the Lucene sandbox, so I could simply grab that instead of having to write one. Second, I found he had implemented a custom version of Edit Distance, where I preferred using the getLevenshtiensDistance() method from the Jakarta commons-lang project. Third, his method of loading up a dictionary was from an existing index, whereas I wanted something more generic. For these reasons, I built my own Lucene based Spell checker, the code for which is shown below:

I first built an NGramAnalyzer using the NGramTokenizer provided by the Lucene Analyzers sandbox project.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// NGramAnalyzer.java
package com.mycompany.spellcheck;

import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.NGramTokenizer;

public class NGramAnalyzer extends Analyzer {

  private int minGram;
  private int maxGram;
  
  public NGramAnalyzer(int minGram, int maxGram) {
    this.minGram = minGram;
    this.maxGram = maxGram;
  }
  
  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new NGramTokenizer(reader, minGram, maxGram);
  }
}

Next I built a SpellChecker which exposes two methods: addToDictionary() and suggestSimilar(). The first allows you to add words to the spelling dictionary (which is a Lucene index with the fields described in the SpellChecker wiki page), and the second returns a List of spelling suggestions for a given word. In addition (unlike Nicolas's code), it allows results that have Soundex equivalence, since sometimes the user's spelling may be wildly off the mark, but he knows what the word sounds like, and we want to catch that too.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
// LuceneSpellChecker.java
package com.mycompany.spellcheck;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.codec.EncoderException;
import org.apache.commons.codec.language.Soundex;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.FSDirectory;

public class LuceneSpellChecker {
  
  private String spellIndexDir;
  private boolean boostEdges;
  private int editDistanceCutoff = 3;
  private int soundexDiffCutoff = 4;
  
  private IndexWriter spellIndexWriter;
  private IndexSearcher spellIndexSearcher;
  
  private class AnalysisResult {
    public String input;
    public List<String> gram3s = new ArrayList<String>();
    public List<String> gram4s = new ArrayList<String>();
    public String start3 = "";
    public String start4 = "";
    public String end3 = "";
    public String end4 = "";
  }
  
  public void setBoostEdges(boolean boostEdges) {
    this.boostEdges = boostEdges;
  }
  
  public void setEditDistanceCutoff(int editDistanceCutoff) {
    this.editDistanceCutoff = editDistanceCutoff;
  }
  
  public void setSoundexDiffCutoff(int soundexDiffCutoff) {
    this.soundexDiffCutoff = soundexDiffCutoff;
  }
  
  public void setSpellIndexDir(String spellIndexDir) {
    this.spellIndexDir = spellIndexDir;
  }
  
  public void init() throws Exception {
    File spellIndexDirFile = new File(spellIndexDir);
    this.spellIndexWriter = new IndexWriter(
      FSDirectory.getDirectory(spellIndexDirFile), new StandardAnalyzer());
    this.spellIndexSearcher = new IndexSearcher(spellIndexDir);
  }
  
  public void destroy() throws Exception {
    spellIndexWriter.close();
    spellIndexSearcher.close();
  }

  public void flush() throws Exception {
    spellIndexWriter.optimize();
  }
  
  public void addToDictionary(String dictionaryEntry) throws IOException {
    AnalysisResult result = analyze(dictionaryEntry);
    Document doc = new Document();
    doc.add(new Field("word", result.input, Store.YES, Index.TOKENIZED));
    for (String gram3 : result.gram3s) {
      doc.add(new Field("gram3", gram3, Store.YES, Index.TOKENIZED));
    }
    for (String gram4 : result.gram4s) {
      doc.add(new Field("gram4", gram4, Store.YES, Index.TOKENIZED));
    }
    doc.add(new Field("start3", result.start3, Store.YES, Index.TOKENIZED));
    doc.add(new Field("start4", result.start4, Store.YES, Index.TOKENIZED));
    doc.add(new Field("end3", result.end3, Store.YES, Index.TOKENIZED));
    doc.add(new Field("end4", result.end4, Store.YES, Index.TOKENIZED));
    spellIndexWriter.addDocument(doc);
  }
  
  public List<String> suggestSimilar(String input, int maxSuggestions) 
      throws IOException, EncoderException {
    AnalysisResult result = analyze(input);
    BooleanQuery query = new BooleanQuery();
    addGramQuery(query, "gram3", result.gram3s);
    addGramQuery(query, "gram4", result.gram4s);
    addEdgeQuery(query, "start3", result.start3);
    addEdgeQuery(query, "end3", result.end3);
    addEdgeQuery(query, "start4", result.start4);
    addEdgeQuery(query, "end4", result.end4);
    Set<String> words = new HashSet<String>();
    Hits hits = spellIndexSearcher.search(query);
    int numHits = hits.length();
    for (int i = 0; i < numHits; i++) {
      Document doc = hits.doc(i);
      String suggestion = doc.get("word");
      // if the suggestion is the same as the input, ignore it
      if (suggestion.equalsIgnoreCase(input)) {
        continue;
      }
      // if the suggestion is within the specified levenshtein's distance, include it
      if (StringUtils.getLevenshteinDistance(input, suggestion) < editDistanceCutoff) {
        words.add(suggestion);
      }
      // if they sound the same, include it
      Soundex soundex = new Soundex();
      if (soundex.difference(input, suggestion) >= soundexDiffCutoff) {
        words.add(suggestion);
      }
    }
    List<String> wordlist = new ArrayList<String>();
    wordlist.addAll(words);
    return wordlist.subList(0, Math.min(maxSuggestions, wordlist.size()));
  }

  private AnalysisResult analyze(String input) throws IOException {
    AnalysisResult result = new AnalysisResult();
    result.input = input;
    Analyzer analyzer = new NGramAnalyzer(3, 4);
    TokenStream tokenStream = analyzer.tokenStream("dummy", new StringReader(input));
    Token token;
    while ((token = tokenStream.next()) != null) {
      String text = token.termText();
      if (text.length() == 3) {
        result.gram3s.add(text);
      } else if (text.length() == 4) {
        result.gram4s.add(text);
      } else {
        continue;
      }
    }
    result.start3 = input.substring(0, Math.min(input.length(), 3));
    result.start4 = input.substring(0, Math.min(input.length(), 4));
    result.end3 = input.substring(Math.max(0, input.length() - 3), input.length());
    result.end4 = input.substring(Math.max(0, input.length() - 4), input.length());
    return result;
  }
  
  private void addGramQuery(BooleanQuery query, String fieldName, List<String> grams) {
    for (String gram : grams) {
      query.add(new TermQuery(new Term(fieldName, gram)), Occur.SHOULD);
    }
  }
  
  private void addEdgeQuery(BooleanQuery query, String fieldName, String fieldValue) {
    TermQuery start3Query = new TermQuery(new Term(fieldName, fieldValue));
    if (boostEdges) {
      start3Query.setBoost(2.0F);
    }
    query.add(start3Query, Occur.SHOULD);
  }
}

Finally, here is the test case which a client for the SpellChecker could use. It is basically the JUnit test I used to test the code above. The test opens a file of pairs of misspelled word and the expected correct spelling and runs through it. The first test puts the correct word into the index, and the next searches for them. The SpellChecker class is instantiated in the test, but could also be accessed from a Spring BeanFactory, since the SpellChecker class is set up to use setter injection.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// LuceneSpellCheckerTest.java
package com.mycompany.spellcheck;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.List;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.After;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

public class LuceneSpellCheckerTest {

  private static final String INPUT_FILE = "src/main/resources/my_tests.txt";
  private static final String INDEX_DIR = "/tmp/spellindex";
  private static final Log LOG = LogFactory.getLog(LuceneSpellCheckerTest.class);

  private LuceneSpellChecker spellChecker;
  
  @BeforeClass
  public static void setUpBeforeClass() throws Exception {
    File indexDirFile = new File(INDEX_DIR);
    if (indexDirFile.exists()) {
      FileUtils.forceDelete(new File(INDEX_DIR));
    }
  }
  
  @Before
  public void setUp() throws Exception {
    spellChecker = new LuceneSpellChecker();
    spellChecker.setSpellIndexDir(INDEX_DIR);
    spellChecker.setBoostEdges(true);
    spellChecker.setEditDistanceCutoff(3);
    spellChecker.setSoundexDiffCutoff(4);
    spellChecker.init();
  }
  
  @After
  public void tearDown() throws Exception {
    spellChecker.destroy();
  }
  
  @Test
  public void testIndexing() throws Exception {
    BufferedReader reader = new BufferedReader(
      new InputStreamReader(new FileInputStream(INPUT_FILE)));
    String line;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("#")) {
        continue;
      }
      String[] parts = StringUtils.split(line, "=");
      if (parts.length == 2) {
        String correctSpelling = StringUtils.split(line, "=")[1];
        if (StringUtils.isNotBlank(correctSpelling)) {
          LOG.debug("Adding to index:" + correctSpelling);
          spellChecker.addToDictionary(correctSpelling);
        }
      }
    }
    spellChecker.flush();
    reader.close();
  }

  @Test
  public void testSearching() throws Exception {
    BufferedReader reader = new BufferedReader(
      new InputStreamReader(new FileInputStream(INPUT_FILE)));
    String line;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("#")) {
        continue;
      }
      String wrongSpelling = StringUtils.split(line, "=")[0];
      List<String> suggestions = spellChecker.suggestSimilar(wrongSpelling, 2);
      LOG.debug("suggestion[" + wrongSpelling + "]=" + suggestions);
    }
    reader.close();
  }
}

The above code is not as complete as Nicolas's, since it does not have the code to do popularity based boosting in suggestSimilar(). Essentially, this means that if the spell checker has two or more suggestions, the caller can specify that the suggestions be returned in order of decreasing popularity in the index. Internally, the code would call docFreq() for each of the suggestions and reorder accordingly.

From my minimal testing, it appears that multi-word dictionary entries are handled as well as single-word entries, but I cannot say for sure. If this is true, then it would make sense to have the popularity functionality. With Jazzy this is not the case, so the application has to deal with figuring out how to get the "best" multi-word suggestion based on a cartesian join of all single-word suggestions. If the same holds true for this approach, then it may make sense to actually apply the popularity metric on the combination of single word suggestions instead.