Saturday, February 23, 2008

Implementing Inheritance in database with JPA

This post is the result of a casual conversation with one of our engineers. About couple of months ago, he happened to mention that if we ever got around to refactoring some code in one of our systems, using JPA for database persistence would be preferable instead of JDBC, as we do currently. I did not know much about JPA at the time, except that it was something to do with EJB3 and Hibernate, so I decided to read up about it.

Turns out I was only partially correct. JPA is an API which is implemented by various popular ORM implementations such as Toplink from Oracle, Kodo from BEA, Hibernate from Redhat's JBoss group and OpenJPA from Apache. In a sense, JPA is to ORMs what JDBC was to databases. It provides a common API to work against multiple ORMs, so developers need to learn one API to work against any JPA compliant ORM, and a company could (at least in theory) switch between ORM providers without changing any source code. Based on previous experience with coding against Hibernate 2.x, I think JPA code (using Hibernate) also looks much simpler.

One book I found very helpful was Chris Maki's "JPA 101: Java Persistence Explained" from Sourcebeat. Its available as a PDF eBook and is fairly reasonably priced, and contains almost no fluff, unless you count the first chapter, which deals with setting up the example application using Eclipse and Maven, which probably most (but not all) developers would be familiar with. However, the book has plenty of examples and is very readable. I would recommend it strongly if you are trying to get started with JPA.

After reading the book, I realized that the engineer's suggestion was pretty much spot-on. In particular, I liked the concept of implementing the application's object inheritance hierarchy directly in the database, which I describe below.

So basically, we license content from various different providers. All content has certain metadata that we will always extract, such as title and summary, and we would always assign that article a unique URL on our website. However, each provider is different, and some may provide additional metadata that is unique to the provider. So basically, consider two data sources, one called Magazine and one called Book. The object UML would look something like this:

The ModelBase class is something that is needed by JPA, and its convenient to set up a single one that enforces on the correct id class (for the given database) for any persistable bean in the application. The Article class specifies the properties we would always extract, regardless of provider, and the MagazineArticle and BookArticle specify the metadata unique to each provider.

JPA allows three different inheritance strategies, which most providers implement. I chose the JOINED strategy, where the properties common to all subclasses are stored in a master table, and the unique properties for each Article subclass are stored in their own tables, linked back using the autogenerated surrogate id as the foreign key. This has the advantage of being quite normalized, and if the inheritance structure is relatively flat (mine would only be one level deep), then the performance overhead of doing joins is minimal. The corresponding database structure for the JOINED subclass strategy would look like this:

Notice the absence of the discriminator column in the Article table above. I spent nearly a day trying to figure out what I was doing wrong before I found that Hibernate, unlike other ORM implementations JPA covers, does not need to use the discriminator column for JOINED inheritance, and hence apparently has no plans to conform to a standard it considers broken in this regard. It does not affect me that much since I plan on using Hibernate anyway, but while I am no expert on these things, I think that this strategy may harm Hibernate's adoption by big companies for whom JPA compliance is a higher priority. The least that should be done IMO is to adequately document this abberation, so other developers are not tripped up like I was.

But anyway, on to the code. Since I was using MySQL, I decided to build a ModelBase object which is annotated as @MappedSuperclass, and which specifies the id type and generation strategy. JPA can work with legacy ids, but it is considerably more work to implement than autogenerated surrogate keys, so I decided to keep things simple. In any case, if we decide to switch to some other database, all we would need to do is change the id generation strategy in this one class (and the provider in the persistence.xml class).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// ModelBase.java
package com.mycompany.myapp.persistence;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

import javax.persistence.CascadeType;
import javax.persistence.GeneratedValue;
import javax.persistence.GenerationType;
import javax.persistence.Id;
import javax.persistence.MappedSuperclass;
import javax.persistence.OneToMany;

@MappedSuperclass
public abstract class ModelBase implements Serializable {

  @Id 
  @GeneratedValue(strategy=GenerationType.AUTO)
  private Long id;
  
  public Long getId() {
    return id;
  }
  
  public void setId(Long id) {
    this.id = id;
  }
}

The above class is marked as @MappedSuperclass, so there is no corresponding table in the database. The next class is the main Article class, also abstract, since we don't want to use this as is, ever. For each content provider, we want to add in extra metadata unique to that provider.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Article.java
package com.mycompany.myapp.persistence;

import java.util.ArrayList;
import java.util.List;

import javax.persistence.DiscriminatorColumn;
import javax.persistence.DiscriminatorType;
import javax.persistence.Entity;
import javax.persistence.Inheritance;
import javax.persistence.InheritanceType;

@Entity
@Inheritance(strategy=InheritanceType.JOINED)
@DiscriminatorColumn(discriminatorType=DiscriminatorType.INTEGER, name="articleTypeId")
public abstract class Article extends ModelBase {

  private String articleId;
  private String title;
  private String summary;
  private String url;
  
  public String getArticleId() {
    return articleId;
  }
  
  public void setArticleId(String articleId) {
    this.articleId = articleId;
  }
  
  public String getTitle() {
    return title;
  }
  
  public void setTitle(String title) {
    this.title = title;
  }
  
  public String getSummary() {
    return summary;
  }
  
  public void setSummary(String summary) {
    this.summary = summary;
  }
  
  public String getUrl() {
    return url;
  }
  
  public void setUrl(String url) {
    this.url = url;
  }
}

Notice that we have the @DiscriminatorColumn annotation. With Hibernate, this has absolutely no effect, and in fact, you don't even have to have this for InheritanceType.JOINED. The only time you need this for Hibernate is for single table inheritance.

The subclasses of Article are shown below. As before, we don't need to have the @DiscriminatorValue annotation in either subclass, since Hibernate will not use it or record it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// MagazineArticle.java
package com.mycompany.myapp.persistence;

import java.util.Date;

import javax.persistence.DiscriminatorValue;
import javax.persistence.Entity;
import javax.persistence.Temporal;
import javax.persistence.TemporalType;

@Entity
@DiscriminatorValue("1")
public class MagazineArticle extends Article {

  private static final long serialVersionUID = 4276734517833727032L;

  private String publicationName;
  
  @Temporal(TemporalType.DATE)
  private Date publicationDate;
  
  private String authorName;
  
  public String getPublicationName() {
    return publicationName;
  }
  
  public void setPublicationName(String publicationName) {
    this.publicationName = publicationName;
  }
  
  public Date getPublicationDate() {
    return publicationDate;
  }
  
  public void setPublicationDate(Date publicationDate) {
    this.publicationDate = publicationDate;
  }
  
  public String getAuthorName() {
    return authorName;
  }

  public void setAuthorName(String authorName) {
    this.authorName = authorName;
  }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// BookArticle.java
package com.mycompany.myapp.persistence;

import javax.persistence.DiscriminatorValue;
import javax.persistence.Entity;

@Entity
@DiscriminatorValue("2")
public class BookArticle extends Article {

  private static final long serialVersionUID = -2274023497279749079L;
  
  private String authorName;
  private String publisherName;
  private String isbnNumber;
  
  public String getAuthorName() {
    return authorName;
  }
  
  public void setAuthorName(String authorName) {
    this.authorName = authorName;
  }
  
  public String getPublisherName() {
    return publisherName;
  }
  
  public void setPublisherName(String publisherName) {
    this.publisherName = publisherName;
  }
  
  public String getIsbnNumber() {
    return isbnNumber;
  }
  
  public void setIsbnNumber(String isbnNumber) {
    this.isbnNumber = isbnNumber;
  }
}

Finally, we need to set up the database. I created a database and populated the tables shown in the database diagram above. Then to link up the code and the database, we create a persistence.xml file in src/main/resources/META-INF directory, like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<?xml version="1.0" encoding="UTF-8"?>
<persistence xmlns="http://java.sun.com/xml/ns/persistence"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/persistence 
    http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd"
    version="1.0">
  <persistence-unit name="myapp" transaction-type="RESOURCE_LOCAL">
    <provider>org.hibernate.ejb.HibernatePersistence</provider>
    <class>com.mycompany.myapp.persistence.ModelBase</class>
    <class>com.mycompany.myapp.persistence.Article</class>
    <!-- put your article subclasses here -->
    <class>com.mycompany.myapp.persistence.MagazineArticle</class>
    <class>com.mycompany.myapp.persistence.BookArticle</class>
    <properties>
      <property name="hibernate.connection.driver_class" 
        value="com.mysql.jdbc.Driver"/>
      <property name="hibernate.connection.url" 
        value="jdbc:mysql://localhost:3306/contentdb"/>
      <property name="hibernate.connection.username" value="jpauser" />
      <property name="hibernate.connection.password" value="jpauser"/>
      <property name="hibernate.dialect" 
        value="org.hibernate.dialect.MySQLDialect"/>
      <property name="hibernate.cache.provider_class" 
        value="org.hibernate.cache.HashtableCacheProvider"/>
      <property name="hibernate.show_sql" value="true"/>
    </properties>
  </persistence-unit>
</persistence>

Here is a JUnit to insert data into this database structure. Most people would probably use DBUnit to do this, but my objective was to find how to insert data into the database using JPA, so I did a unit test. Notice how the code is pretty much unaware of the underlying database structure. It deals with Java objects, and the JPA entityManager does the work of creating and executing the SQL for it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// ArticlePersistenceTest.java
package com.mycompany.myapp.persistence;

import java.util.ArrayList;
import java.util.Calendar;
import java.util.Date;
import java.util.List;

import javax.persistence.EntityManager;
import javax.persistence.EntityManagerFactory;
import javax.persistence.Persistence;
import javax.persistence.Query;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class ArticlePersistenceTest {

  private final Log log = LogFactory.getLog(getClass());
  
  private EntityManager entityManager;
  private EntityManagerFactory entityManagerFactory;
  
  @Before
  public void setUp() throws Exception {
    entityManagerFactory = Persistence.createEntityManagerFactory("myapp");
    entityManager = entityManagerFactory.createEntityManager();
  }

  @After
  public void tearDown() throws Exception {
    entityManager.close();
    entityManagerFactory.close();
  }

  @Test
  public void testPersistAdamArticle() throws Exception {

    // build and persist a magazine article
    MagazineArticle ma = new MagazineArticle();
    ma.setArticleId("mag-000-001");
    ma.setTitle("Magazine Article Title 1");
    ma.setSummary("This is a short summary of magazine article 0001...");
    ma.setUrl("/path/to/mag-art-0001");
    Calendar pubDateCalendar = Calendar.getInstance();
    pubDateCalendar.set(2002, 11, 15); 
    ma.setPublicationDate(pubDateCalendar.getTime());
    ma.setPublicationName("Harper Collins");
    ma.setAuthorName("Dr Doolittle");

    entityManager.getTransaction().begin();
    entityManager.persist(ma);
    entityManager.getTransaction().commit();

    // build and persist a book article
    BookArticle ba = new BookArticle();
    ba.setArticleId("bk-000-001");
    ba.setTitle("Book Article Title 1");
    ba.setSummary("This is a short summary of book article 0001...");
    ba.setUrl("/path/to/book-art-0001");
    ba.setAuthorName("Dr Busybody");
    ba.setPublisherName("Tom Collins");
    ba.setIsbnNumber("1234-5678");

    entityManager.getTransaction().begin();
    entityManager.persist(ba);
    entityManager.getTransaction().commit();
    
    // select all articles
    Query q = entityManager.createQuery("select a from Article a");
    List<Article> results = q.getResultList();
    for (Article result : results) {
      log.debug("result=" + result.toString());
    }
  }
}

In my code, I purposely kept the code as free of override annotations as possible, which may not be possible in real life. For example, your DBA may enforce a particular table naming or column naming structure. You can map beans to corresponding table names using the @Table annotation, and property names to corresponding column names using the @Column annotation.

Another thing I noticed was that the performance of the JPA code is slightly slower compared to straight JDBC calls. However, this is expected, since JPA provides a level of abstraction that allows us to write more readable code, and does some generic heavy lifting behind the scenes that would be concievably less efficient than hand crafted SQL. I think this becomes less noticeable when we run the applications over longer periods of time and we are able to take advantage of the ORM's cache.

Overall, I was quite impressed with JPA. The JOINED subclass strategy is conceptually nicer than the table per class strategy we have currently implemented using straight JDBC. With a JOINED strategy, we can enforce that certain fields will need to be populated regardless of a provider. It is also normalized, with no repetition of column names across individual tables. Often, the implementor of a new table will use different column names or column types, which makes it harder to work with the articles in a generic way on the front end.

As for the learning curve involved with JPA, obviously there is one, but I forsee that JPA will soon be as ubiquitous as JDBC is today. Already, more and more Java shops are switching over to ORMs, and there are plenty of free and open-source products available which are as good as their commercial counterparts. Learning the JPA API will enable you to work with the JPA compliant ORMs out there, and now that it supports annotations, its just a matter of learning a few simple annotations to get going with JPA.

Sunday, February 17, 2008

A Generic BerkeleyDB store using DPL

I have written before about how much I liked the annotation driven persistence mechanism that BerkeleyDB Java Edition provides using its Direct Persistence Layer (DPL). I had an opportunity to look at it once more this weekend, this time with a view to persisting arbitary objects into Maps keyed by a unique String value.

The objects to be persisted are arbitary in the sense that the caller of the persistence code would know for sure what objects need to be persisted, and would persist the same class of objects into a given BerkeleyDB store. However, the code that did the persisting would not know what objects it was working with until it was instantiated by the caller. To do this, we define a generic StoreEntity object that persists objects of type V.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// StoreEntity.java
package com.mycompany.bdb;

import com.sleepycat.persist.model.Entity;
import com.sleepycat.persist.model.PrimaryKey;

@Entity
public class StoreEntity<V> {

  @PrimaryKey private String key;
  private V value;
  
  public StoreEntity() {
    super();
  }
  
  public String getKey() {
    return key;
  }
  
  public void setKey(String key) {
    this.key = key;
  }
  
  public V getValue() {
    return value;
  }
  
  public void setValue(V value) {
    this.value = value;
  }
}

The StoreEntity objects are persisted by a Store class which take care of initializing the database at startup in its init() method, and clean up resource handles in its destroy() method. It provides two methods getValue(String) to get an object of type V from the BerkeleyDB database and a setValue(String, V) to save the object V keyed by the String into the database.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// Store.java
package com.mycompany.bdb;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import com.sleepycat.je.Environment;
import com.sleepycat.je.EnvironmentConfig;
import com.sleepycat.persist.EntityStore;
import com.sleepycat.persist.PrimaryIndex;
import com.sleepycat.persist.StoreConfig;

public class Store<V> {

  private final Log log = LogFactory.getLog(getClass());
  
  private String dataDirectory;
  
  private Environment env;
  private EntityStore store;
  
  public void setDataDirectory(String dataDirectory) {
    this.dataDirectory = dataDirectory;
  }
  
  protected void init() throws Exception {
    File dataDir = new File(dataDirectory);
    if (! dataDir.exists()) {
      FileUtils.forceMkdir(dataDir);
    }
    EnvironmentConfig environmentConfig = new EnvironmentConfig();
    environmentConfig.setAllowCreate(true);
    environmentConfig.setTransactional(true);
    env = new Environment(dataDir, environmentConfig);
    StoreConfig storeConfig = new StoreConfig();
    storeConfig.setAllowCreate(true);
    storeConfig.setTransactional(true);
    store = new EntityStore(env, dataDir.getName(), storeConfig);
  }
  
  protected void destroy() throws Exception {
    if (store != null) {
      store.close();
    }
    if (env != null) {
      env.close();
    }
  }
  
  @SuppressWarnings("unchecked")
  public V getValue(String key) throws Exception {
    Class<?> entityClass = StoreEntity.class;
    PrimaryIndex<String,StoreEntity<V>> primaryIndex = 
      (PrimaryIndex<String,StoreEntity<V>>) store.getPrimaryIndex(
      key.getClass(), entityClass);
    StoreEntity<V> entity = (StoreEntity<V>) primaryIndex.get(key);
    return entity.getValue();
  }
  
  @SuppressWarnings("unchecked")
  public void setValue(String key, V value) throws Exception {
    StoreEntity<V> entity = new StoreEntity<V>();
    entity.setKey(key);
    entity.setValue(value);
    PrimaryIndex<String,StoreEntity<V>> primaryIndex = 
      (PrimaryIndex<String,StoreEntity<V>>) store.getPrimaryIndex(
      key.getClass(), entity.getClass());
    primaryIndex.put(entity);
  }
}

To use this, the client code looks something like this. Obviously, the client code would be better structured than this, probably pulling out the init() and destroy() calls out into its own init() and destroy() lifecycle methods, rather than lumping them together as shown below, but you get the idea.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
public class ClientCode() {
  ...
  public void sampleCode() throws Exception {
    // initialize the store
    store = new Store<List<String>>();
    store.setDataDirectory(MY_BDB_DATA_DIR);
    store.init();
    // save something into the store
    String id = "some_id";
    List<String> values = new ArrayList<String>();
    values.add("value_1");
    values.add("value_2");
    store.setValue(id, values);
    ...
    // retrieve the value from the store
    List<String> retrievedValues = store.getValue(id);
    ...
    // clean up
    store.destroy();
  }
  ...
}

If you have read my earlier blog referenced above, the code here is virtually identical to the code in there. The only difference is the use of generics to make the code reusable regardless of the payload to be persisted, without having to repeat all the boilerplate code that is needed to initialize the BerkeleyDB store.

My next step was to try and make it configurable using Spring, which is where I ran into issues. I wanted the client to be able to configure multiple such stores, each servicing a particular data type (Java objects, custom objects, or collections of either) by specifying the class name of V and the name of the subdirectory where the data should be persisted. Passing in the class name of V was an idea I got from this IBM Developerworks article - "Don't Repeat your DAO".

However, I could not find an easy way to build a Store<Whatever> object using the Class.forName() mechanism, where Whatever could either be a simple Java object, such as String or Integer, or a custom Java object, or a Collection of Java objects or custom objects. Gafter's Gadget looked kind of promising, but wasn't exactly what I was looking for.

From what I have read from other posts on this subject, what I am trying to do is probably impossible in Java at the moment. Basically, using Class.forName() style calls to reflectively build a class instance whose class name is known is not that simple with generic objects. So generics gives you flexibility at compile time, while Class.forName() gives you the same flexibility at run time. Apparently, you can't have your cake and eat it too.

Of course, I could just implement the factory in code, with a Map of store names and corresponding Store implementations, which I could set up at application startup. However, I would rather not do that if I can help it. If anyone knows of a good way to do this, or know of resources you think might help, would appreciate you pointing me at them.

Saturday, February 09, 2008

Debugging and Profiling with Eclipse

This post contains some settings I use for remote debugging web applications using the Jetty and Tomcat containers, and profiling web applications deployed on a remote Tomcat server, using the Eclipse IDE. By remote I mean connecting over a socket, the container can (and does in my case, unless I am connecting from home) listen on a port on the local host. The stuff here is hardly original, it has been gleaned from various web pages and blogs, which I reference in the appropriate places. If you use (or are considering using) Eclipse and want to know how to do remote debugging and profiling, this information may be of some use to you.

Debugging

I have been using the Eclipse IDE (with the MyEclipse extension) for about 3 years now. Most of the time, when debugging, I just use logger.debug() calls within the code to see whats going on. I do know how to debug using the Eclipse Debug perspective, but I guess its just a habit I developed, and old habits die hard. I don't even use Eclipse's CVS perspective anymore, based on some bad experiences at a previous company where I tried but ended up inadvertently removing from CVS code that I removed locally in my IDE (it was incorrect usage on my part). However, lately, I am starting to find debugging very useful, mainly because of the long stop-deploy-start cycle for our main web application.

Unlike a lot of IDE users, I like to run my web container from the command line rather than from the IDE. This is because of two reasons. First, I think the primary goal should be being able to build a WAR file using Ant (or Maven) and being able to deploy to a container. A lot of IDEs make you go through various hoops to make the webapp "compliant", where the definition of what constitutes compliance can vary from IDE to IDE. As an Eclipse user, I have been a minority at my last two jobs, where the majority of Java developers use IDEA, so it usually turns out that I have to make Eclipse comply with what IDEA thinks is a webapp. Second, having to stop and restart the app within a container running within your IDE involves using your mouse (or in case of a laptop, your touchpad), which is way less convenient than the command line with command-history enabled.

We run and develop our main web application using Tomcat. I have been building Maven apps for quite a while now, and I tend to use the Maven-Jetty plugin because its so much more convenient. For Maven webapps, I tend to do most of my development using Jetty, then deploy to the Tomcat server. The upshot is that I need to be able to debug using remote Tomcat and Jetty instances.

Remote Debugging with Tomcat

The information here is from the Tomcat FAQ Wiki. Basically, you add this in to the $CATALINA_HOME/bin/setenv.sh file. My CATALINA_HOME is at /opt/apache-tomcat-5.5.25. If you already have a JAVA_OPTS defined for application-specific stuff, just add the stuff below to your JAVA_OPTS.

1
2
3
# /opt/apache-tomcat-5.5.25/bin/setenv.sh
export JAVA_OPTS="-Xdebug \
  -Xrunjdwp:transport=dt_socket,address=8787,server=y,suspend=n"

The address=8787 enables a debug listener on Tomcat that Eclipse can connect to to get debug information. On the Eclipse, side, open the Debug Launch Configuration Dialog by clicking "Run > Open Debug Dialog". On the left pane of the dialog, find "Remote Java Application", select and right-click (or click on the New icon on the top). This will open up a Dialog for setting parameters for a Debug Launch configuration. Here are my values:

Tab name Property name Property value Description
- Name Tomcat (Pluto:8080) Can be any name you want to give it. Mine says what and where
Connect Project hl-www This is your project name
Connect Connection Type Standard - Socket Attach Connect over a socket
Connect Connection Properties : Host pluto.healthline.com DNS name of the host, could be an IP address (I think)
Connect Connection Properties : Port 8787 Same port as specified in address above
Connect Allow termination of remote VM No This is really your choice, I just don't want it.
Source Source Lookup Path Select your project This is so you can see the sources as you debug
Source Source Lookup Path Select any other source jars you have This is so you can see the sources as you debug
Common Display in Favorites Menu Yes This adds the config as a bookmark under the debug icon.

Deploy your app to the Tomcat container and restart Tomcat. In Eclipse, switch to the Debug perspective and a breakpoint in in your code (say a controller you want to call). In Eclipse's Debug perspective, [Alt]-[Shift]-B allows you to set (or unset) breakpoints at particular points in your code. Open up a browser and point to the page you want to debug. Bringing the page up will activate the debugger in Eclipse and you will see the code where you set the breakpoint being highlighted, with the top right corner containing the variables to be inspected. You can use [F6] through [F8] keys to step over, into and out of breakpoints. You probably know how to take it from here.

Remote Debugging with the Maven-Jetty plugin

Information for this comes from Dan Allen's blog post Remote Debugging with Jetty. Unlike Tomcat, this time you have to set the debugging parameters within MAVEN_OPTS since Maven runs its classworlds Launcher instead of Java. The MAVEN_OPTS need to be set in your configuration (either in your ~/.bash_profile or in a shell script that calls the mvn jetty6:run command). As before, if you already have other stuff in your MAVEN_OPTS, the stuff below needs to go after that.

1
2
export MAVEN_OPTS="-Xdebug -Xnoagent -Djava.compiler=NONE \
  -Xrunjdwp:transport=dt_socket,address=8781,server=y,suspend=n"

You also need to disable the Jetty maxIdleTime interval by setting it to 0. This is done in the pom.xml file like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
<project ...>
  ...
  <build>
    <plugins>
      <plugin>
        <groupId>org.mortbay.jetty</groupId>
        <artifactId>maven-jetty6-plugin</artifactId>
        <configuration>
          <scanIntervalSeconds>10</scanIntervalSeconds>
          <connectors>
            <connector implementation="org.mortbay.jetty.nio.BlockingChannelConnector">
              <port>8081</port>
              <maxIdleTime>0</maxIdleTime>
            </connector>
          </connectors>
        </configuration>
        <dependencies>
          <dependency>
            <groupId>org.apache.geronimo.specs</groupId>
            <artifactId>geronimo-j2ee_1.4_spec</artifactId>
            <version>1.0</version>
            <scope>provided</scope>
          </dependency>
        </dependencies>
      </plugin>
    </plugins>
  </build>
</project>

On the Eclipse side, the setup is identical to the Tomcat setup described above. Simply change the name (mine is called Jetty (Pluto:8081)) and the port number of the listener to what you set it to in MAVEN_OPTS (mine is 8781).

Profiling

Recently, I needed to profile a web application I wrote. It was taking 4-8 seconds to serve a single page on a production class machine, compared to an expectation of about 40-80 milliseconds. Response times on my much less powerful development box, while not 40-80ms, were tolerable. My initial reaction was to put StopWatch calls within the handleRequest() method of the Controller, timing the blocks which I thought could do with improvement. That detected some places where it was spending more time than I thought it should, so I fixed those, but the pages were still dog slow on production. Moreover, it seemed that response times were degrading under load, and load on the database machines was spiking so as to make them almost unusable. What I needed was a profiler, but I did not know how to set one up, much less know how to run it and interpret the results.

However, good things sometimes happen to bad programmers, and our local performance guru was kind enough to set up a profiling instance on his Netbeans IDE (he is an IDEA user, but he uses Netbeans for its awesome profiling tool) and run a profile for me. It did identify several more hotspots in the code that could be optimized, and I fixed them. The performance did improve somewhat as a result, but we were still seeing spikes on the database machines.

The problem turned out to be contention for the same database resource with another web application, which I figured out by just thinking through it and looking through the code. However, the profiler output helped me weed out the unnecessary stuff quickly. So although the best way to find performance problems is still, in my opinion, just trolling through code coupled with an understanding of the program flow, a profiler makes the process much faster, because it has already told you what you are not looking for.

While I now know (thanks to the same guy who helped me out with the performance numbers before) how to do profiling with the Netbeans IDE, I wanted to do this from within Eclipse using the TPTP plugin, so what follows is my setup for doing that.

Remote Profiling Tomcat apps

Information from this comes from this profiling java blog post, which has a link to a Eclipse-TPTP setup Howto on Windows XP, which I adapted for my use. TPTP needs a client component to be installed in the Eclipse IDE (the TPTP plugin), and an agent component RAServer which mediates between the performance data from the Tomcat server and the Eclipse TPTP client. Huge amounts of profiling data are transferred as XML documents, so using this from a remote (not localhost) client is very slow. Therefore, three things need to be setup to use TPTP to profile remote apps under Eclipse.

First, we need to download the TPTP plugin. If you are using a recent version of Eclipse (I am using 3.3.1.1) then you can get the plugin from the Europa Discovery Site. Simply click on "Help > Software Updates > Find and Install > Search for new features to install", then select the Performance and Monitoring features and click on "Select Required". This will download the TPTP plugin to your IDE. Restart your IDE to see the Profile icon on the toolbar, and "Run > Profile..." entries in your menu. The complete procedure is explained in detail in the Installing TPTP using Update Manager page.

Second, we need to install the agent component. This is available as a separate download for the particular architecture and operating system from the TPTP home page (scroll down to Agent Controller). Here is a link to the one I used.

Setting this up was easy, but not totally straightforward. The first step is to unzip the download into /opt/tptpdc-4.1.0, then set up the following environment variables in your ~/.bash_profile and source it. Here is the snippet from my ~/.bash_profile file.

1
2
3
4
# TPTP settings
export RASERVER_HOME=/opt/tptpdc-4.1.0
export PATH=$RASERVER_HOME/bin:$PATH
export LD_LIBRARY_PATH=$RASERVER_HOME/lib:$LD_LIBRARY_PATH

We then need to navigate to $RASERVER_HOME/bin, then run SetConfig.sh (the very first time only) to set up the XML file for RAServer to work. Then from the same directory, we need to start the server using RAStart.sh (the corresponding stop script is RAStop.sh in the same directory). However, when I ran the RAStart.sh script, I discovered that there were missing libraries on my Fedora Core 7 system. To fix that, I had to download the libstdc++ compatibility RPM from the RPMFind page and install it using the following command:

1
$ sudo rpm -ivh compat-libstdc++-296-2.96-138.i386.rpm

Finally, we need to set up the JAVA_OPTS environment variable in the $CATALINA_HOME/bin/setenv.sh file, like so. Also, since we are starting Tomcat with the profiling instrumentation enabled, I found that it would complain about missing libraries, which went away after I added the RASERVER_HOME paths to PATH and LD_LIBRARY_PATH to the setenv.sh file.

1
2
3
4
5
# /opt/apache-tomcat-5.5.25/bin/setenv.sh
export RASERVER_HOME=/opt/tptpdc-4.1.0
export PATH=$RASERVER_HOME/bin:$PATH
export LD_LIBRARY_PATH=$RASERVER_HOME/lib:$LD_LIBRARY_PATH
export JAVA_OPTS="-XrunpiAgent:server=enabled"

To start using profiling, I deployed the web application to Tomcat, started RAServer, then started Tomcat.

On the Eclipse side, I built a Profiling Launch configuration by clicking "Run > Profile", then right-clicking New on "Attach to Agent" on the left pane of the resulting dialog. Here are the settings for my IDE.

Tab name Property name Property value Description
- Name WWW (Pluto:8080) Can be anything. Mine says what and where.
Hosts Default Hosts Added pluto.healthline.com:10002 localhost:10002 was already there and could not remove it. Adding the new one and selecting it makes that the current host.
Agents Available Agents Click on Refresh to get the standard Agent exposed by RAServer and select it. localhost:10002 was already there and could not remove it. Adding the new one and selecting it makes that the current host.
Destination Profiling Project I just chose the same project name I was monitoring. -
Destination Monitor Choose Default Monitor (the default) -
Common Display in Favorites Menu Yes Makes the configuration appear when the Profile icon is clicked.

Once this is done, switch to the profiling perspective. If the agent has been discovered, Eclipse will attach to it and start collecting statistics. Since a web app's job is to serve pages, what I do is to aim a URL generating script at the application. Here is an example of a Python script that reads a list of URLs from a text file and hits the app with the URLs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/usr/bin/python
# Simple harness to run the URLs from the systemtesturls.txt manually
import sys
import string
import httplib
import time

def usage():
  print "Usage:" + sys.argv[0] + " www.myhost.com:80 /path/to/urllist"
  sys.exit(-1)

def main():
  if (len(sys.argv) != 3):
    usage()
  host = sys.argv[1]
  urllist = open(sys.argv[2], 'r')
  totaltime = 0
  maxtime = 0
  mintime = 0
  lno = 0
  okresults = 0
  badresults = 0
  while 1:
    urlline = urllist.readline()
    if (not urlline):
      break
    if (urlline.startswith("#")):
      continue
    lno = lno + 1
    testurl = string.rstrip(urlline)
    print "Testing (" + str(lno) + "): " + testurl
    start = time.clock()
    conn = httplib.HTTPConnection(host)
    conn.request("GET", testurl)
    resp = conn.getresponse()
    status = resp.status
    if (status == 200):
      okresults = okresults + 1
    else:
      badresults = badresults + 1
      print "Error:", status, resp.reason, str(lno)
    data = resp.read()
    conn.close()
    stop = time.clock()
    elapsed = stop - start
    if (elapsed < mintime):
      mintime = elapsed
    if (elapsed > maxtime):
      maxtime = elapsed
    totaltime = totaltime + elapsed
  urllist.close()
  print "quality results, Ok=" + str(okresults) + ", Bad=" + str(badresults) + ", Total=" + str(lno)
  print "timing results: min(s)=" + str(mintime) + ", max(s)=" + str(maxtime) + ", avg(s)=" + str((totaltime / lno))

if __name__ == "__main__":
  main()

Once the script completes, you can stop the profiling. I was able to generate three reports from it - Execution Statistics, Memory Statistics and Coverage Statistics. Of these, I found the Execution statistics the most useful since it told me how many times a method was called, and what processing time on average was spent in each of these methods. Undoubtedly I will find more use for the other reports in the future, but for the moment I am happy to have profiling working under Eclipse.

Update Feb 16 2008

I was able to profile using Maven's Jetty plugin as well recently. Instead of setting the string "-XrunpiAgent:server=enabled" to JAVA_OPTS, we just set it to MAVEN_OPTS instead, then run mvn -o jetty6:run. The RASERVER_HOME, LD_LIBRARY_PATH and PATH setting also needs to be in there for the agent to work correctly. So my new improved jetty.sh now looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/bin/bash
BASE_MAVEN_OPTS="-Xmx2048m"
DEBUG_MAVEN_OPTS="-Xdebug -Xnoagent -Djava.compiler=NONE -Xrunjwdb:transport=dt_cket,address=8781,server=y,suspend=n"
PROFILE_MAVEN_OPTS="-XrunpiAgent:server=enabled"
case $1 in
  'debug')
    MAVEN_OPTS=$BASE_MAVEN_OPTS" "$DEBUG_MAVEN_OPTS
    ;;
  'profile')
    export RASERVER_HOME=/opt/tptpdc-4.1.0
    export PATH=$RASERVER_HOME/bin:$PATH
    export LD_LIBRARY_PATH=$RASERVER_HOME/lib:$LD_LIBRARY_PATH
    MAVEN_OPTS=$BASE_MAVEN_OPTS" "$PROFILE_MAVEN_OPTS
    ;;
  *)
    MAVEN_OPTS=$BASE_MAVEN_OPTS
    ;;
esac
export MAVEN_OPTS
mvn -o jetty6:run

To start a normal session, I just call jetty.sh, for debugging and profiling, I call jetty.sh debug and jetty.sh profile respectively. On the Eclipse side, I create a profile configuration in the same way as for Tomcat, by attaching the profiling client to the running Java application. The RAServer detects the Java app that is exposing profiling information, and automatically discovers it.

Update Feb 27 2008

This post was republished by the folks at SYS-CON Media in their Open Web Developer's Journal and is available here. Goes to show that one should be careful about what one writes, it may end up anywhere :-). Thanks to Jeremy Geelan for making this happen.

Saturday, February 02, 2008

Spatial Search with Lucene

In his book, "Object-Relational DBMSs - The Next Great Wave", Dr Michael Stonebraker wrote about possible extensions to commercial RDBMSs that will allow SQL queries of the form:

1
2
3
4
5
6
7
8
create table mytable (
  info varchar(255) not null,
  coord Point not null
);
insert into mytable(info, coord) values ('foo', Point(-100,35));
...
select info from mytable
where distance(coord, Point(-110,37)) < 10;

Here Point(x,y) is a user-defined data type that represents a point in 2D space. Since then, many open source and commercial databases, such as Postgresql, MySQL and Oracle and probably many others have already implemented spatial extensions. One great use case for these is in geographic search, where users enter their location in the form of an address, and which is looked up in some sort of database like the Tigerline database from the US Census Bureau, and then used as a base for searching for certain types of businesses "close to" the origin.

Recently, on what is possibly my n-th time reading Eric Hatcher's "Lucene in Action" book, I came across an example of doing this with Lucene. I thought I'd try to build some code to do this as per the recommendations in the book, which is what this post is all about.

First, I build the input and output beans. The GeoPoint is a represents a point on the earth, and is instantiated with longitude and latitude values. It also has methods for normalizing the values for ease of searching with a Lucene RangeQuery and a method to calculate the distance between two points using Pythagoras' theorem. The GeoResult represents the object that the search results would populate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// GeoPoint.java
package com.mycompany.geosearch;

import org.apache.commons.lang.StringUtils;

/**
 * Simple bean to represent a single point on the earth.
 */
public class GeoPoint {

  private double longitude;
  private double latitude;

  public GeoPoint(double longitude, double latitude) {
    setLongitude(longitude);
    setLatitude(latitude);
  }
  
  public double getLongitude() {
    return longitude;
  }
  
  public void setLongitude(double longitude) {
    this.longitude = longitude;
  }
  
  public double getLatitude() {
    return latitude;
  }
  
  public void setLatitude(double latitude) {
    this.latitude = latitude;
  }

  public String getNormalizedLongitude() {
    return normalize(getLongitude(), 180);
  }

  public String getNormalizedLatitude() {
    return normalize(getLatitude(), 90);
  }
  
  private String normalize(double coord, int offset) {
    Double d = coord + offset;
    String s = String.valueOf(d);
    String[] parts = StringUtils.split(s, ".");
    if (parts[1].length() > 6) {
      parts[1] = parts[1].substring(0, 6);
    }
    return StringUtils.leftPad(parts[0], 3, "0") + 
      StringUtils.rightPad(parts[1], 6, "0");
  }
  
  public double distanceFrom(GeoPoint anotherPoint) {
    double distX = Math.abs(anotherPoint.getLongitude() - this.getLongitude());
    double distY = Math.abs(anotherPoint.getLatitude() - this.getLatitude());
    return Math.sqrt((distX * distX) + (distY * distY));
  }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// GeoResult.java
package com.mycompany.geosearch;

import org.apache.commons.lang.builder.ReflectionToStringBuilder;
import org.apache.commons.lang.builder.ToStringStyle;

/**
 * Bean to represent results of a GeoSearch.
 */
public class GeoResult {

  private String name;
  private String address;
  private String phone;
  private String category;
  private double distanceKmFromOrigin;
  private double latitude;
  private double longitude;
  
  public String getName() {
    return name;
  }
  
  public void setName(String name) {
    this.name = name;
  }
  
  public String getAddress() {
    return address;
  }
  
  public void setAddress(String address) {
    this.address = address;
  }
  
  public String getPhone() {
    return phone;
  }

  public void setPhone(String phone) {
    this.phone = phone;
  }

  public String getCategory() {
    return category;
  }

  public void setCategory(String category) {
    this.category = category;
  }

  public double getDistanceKmFromOrigin() {
    return distanceKmFromOrigin;
  }

  public void setDistanceKmFromOrigin(double distanceKmFromOrigin) {
    this.distanceKmFromOrigin = distanceKmFromOrigin;
  }

  public double getLatitude() {
    return latitude;
  }
  
  public void setLatitude(double latitude) {
    this.latitude = latitude;
  }
  
  public double getLongitude() {
    return longitude;
  }
  
  public void setLongitude(double longitude) {
    this.longitude = longitude;
  }

  public String toString() {
    return ReflectionToStringBuilder.reflectionToString(
      this, ToStringStyle.NO_FIELD_NAMES_STYLE);
  }
}

Next comes the GeoSearcher, which takes a GeoPoint object, and the distance within which to search. It will build a BooleanQuery consisting of two ConstantScoreRangeQuery (this is an improved RangeQuery object, new in Lucene 2.2) objects, one for the latitude range to search, and one for the longitude range to search. The GeoPoint.normalize() method is used to add 90 to the latitude values (so the South Pole is at normalized latitude 0 instead of -90 and the Equator is at normalized longitude 90 instead of 0) and 180 to the longitude values (so the International Date Line is at longitude 1 instead of -180 and the Big Ben is at normalized longitude 180 instead of 0). It also pads the rear and front of the number with zeros so the RangeQuery can work with it.

However, the values returned fall into a square, so we need to calculate the distance along the hypotenuse using the GeoPoint.distanceFrom() method to make sure we only consider results within the circular covered by points within the specified distance from the origin.

There is one major (incorrect) assumption here. One of them is that the earth is flat (a view popular among the scientific community in the 15th and 16th centuries), so the distance between two longitude values is considered the same regardless of which latitude the origin is on. Obviously, the further you get away from the equator, the closer neighboring latitudes get, and at the poles, there is no distance at all. I could not find the math to calculate this, and I was too lazy to figure it out, so I just put in a placeholder method calculateKilometersPerLongitudeDegree() which returns the same constant value as the kilometers per latitude.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
// GeoSearcher.java
package com.mycompany.geosearch;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.CachingWrapperFilter;
import org.apache.lucene.search.ConstantScoreRangeQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.QueryWrapperFilter;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopFieldDocs;
import org.apache.lucene.search.BooleanClause.Occur;

/**
 * Searcher to return documents that whose latitude and longitude falls within
 * the specified distance (in miles).
 */
public class GeoSearcher {

  private static final long serialVersionUID = -6301888193164748995L;

  private static final String LATITUDE_FIELD_NAME = "normlat";
  private static final String LONGITUDE_FIELD_NAME = "normlon";
  private static final String FILTER_FIELD_NAME = "category";
  private static final int MAX_RESULTS = 10;
  
  private static final double KILOMETERS_PER_DEGREE = 111.3171;
  
  private final Log log = LogFactory.getLog(getClass());
  
  private IndexSearcher geoIndexSearcher;
  
  public GeoSearcher(String indexDir) throws IOException {
    this.geoIndexSearcher = new IndexSearcher(indexDir);
  }

  public List<GeoResult> naiveSearch(final GeoPoint origin, int distanceKms, 
      String categoryFilter) throws IOException {
    List<GeoResult> results = new ArrayList<GeoResult>();
    Query query = buildQuery(origin, distanceKms);
    // category is filtered using cached query filters. Since categories are
    // going to be a finite set of values in a given application, it makes 
    // sense to have them as query filters, since they are cached.
    CachingWrapperFilter queryFilter = null;
    if (StringUtils.isNotEmpty(categoryFilter)) {
      queryFilter = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(
        new Term(FILTER_FIELD_NAME, categoryFilter))));
    }
    Hits hits = geoIndexSearcher.search(query, queryFilter);
    int numHits = hits.length();
    for (int i = 0; i < numHits; i++) {
      Document doc = hits.doc(i);
      GeoPoint point = new GeoPoint(
        Double.valueOf(doc.get("lon")), Double.valueOf(doc.get("lat")));
      double distanceKmFromOrigin = point.distanceFrom(origin) / KILOMETERS_PER_DEGREE;
      if (distanceKmFromOrigin > distanceKms) {
        // enforce that all results within a circular area
        continue;
      }
      results.add(buildGeoResultFromDocument(doc, point, distanceKmFromOrigin));
    }
    // sort by distance, closest result to origin first
    Collections.sort(results, new Comparator<GeoResult>() {
      public int compare(GeoResult result1, GeoResult result2) {
        double distance1 = result1.getDistanceKmFromOrigin();
        double distance2 = result2.getDistanceKmFromOrigin();
        if (distance1 == distance2) {
          return 0;
        } else if (distance1 < distance2) {
          return -1;
        } else {
          return 1;
        }
      }
    });
    return results;
  }
  
  public List<GeoResult> recommendedSearch(final GeoPoint origin, int distanceKms, 
      String categoryFilter) throws IOException {
    List<GeoResult> results = new ArrayList<GeoResult>();
    Sort sort = new Sort(new SortField(LATITUDE_FIELD_NAME, new GeoSortComparatorSource(origin)));
    Query query = buildQuery(origin, distanceKms);
    TopFieldDocs topFieldDocs = geoIndexSearcher.search(query, null, MAX_RESULTS, sort);
    // we just ask for the top MAX_RESULTS, so limit it
    int totalHits = Math.min(topFieldDocs.totalHits, MAX_RESULTS);
    for (int i = 0; i < totalHits; i++) {
      Document doc = geoIndexSearcher.doc(topFieldDocs.scoreDocs[i].doc);
      GeoPoint point = new GeoPoint(
          Double.valueOf(doc.get("lon")), Double.valueOf(doc.get("lat")));
      double distanceKmFromOrigin = point.distanceFrom(origin) / KILOMETERS_PER_DEGREE;
      if (distanceKmFromOrigin > distanceKms) {
        // enforce that all results within a circular area
        continue;
      }
      results.add(buildGeoResultFromDocument(doc, point, distanceKmFromOrigin));
    }
    return results;
  }

  /**
   * Method to close the searcher from client code.
   * @exception IOException if one is thrown.
   */
  public void close() throws IOException {
    geoIndexSearcher.close();
  }

  /**
   * Build a Range Query from the origin and the distance in kilometers to search
   * within. The RangeQuery will return all documents that are in a square area
   * around the origin.
   * @param origin the GeoPoint object corresponding to the origin.
   * @param distanceKms the distance in kilometers on each side of the origin to search.
   * @return a BooleanQuery containing two RangeQueries.
   * @throws IOException if one is thrown.
   */
  private Query buildQuery(GeoPoint origin, int distanceKms) throws IOException {
    double spreadOnLongitude = 
      distanceKms / calculateKilometersPerLongitudeDegree(origin.getLatitude());
    double spreadOnLatitude = distanceKms / KILOMETERS_PER_DEGREE;
    GeoPoint topLeft = new GeoPoint(origin.getLongitude() - spreadOnLongitude, 
      origin.getLatitude() - spreadOnLatitude);
    GeoPoint bottomRight = new GeoPoint(origin.getLongitude() + spreadOnLongitude, 
      origin.getLatitude() + spreadOnLatitude);
    BooleanQuery query = new BooleanQuery();
    ConstantScoreRangeQuery latitudeQuery = new ConstantScoreRangeQuery(
      LATITUDE_FIELD_NAME,
      topLeft.getNormalizedLatitude(),
      bottomRight.getNormalizedLatitude(),
      true, true);
    query.add(new BooleanClause(latitudeQuery, Occur.MUST));
    ConstantScoreRangeQuery longitudeQuery = new ConstantScoreRangeQuery(
      LONGITUDE_FIELD_NAME,
      topLeft.getNormalizedLongitude(),
      bottomRight.getNormalizedLongitude(),
      true, true);
    query.add(new BooleanClause(longitudeQuery, Occur.MUST));
    log.debug("query:" + query.toString());
    return query;
  }

  /**
   * The kilometers per longitude degree will decrease as we move up from
   * the equator to the poles, but for simplicity (and until I figure out
   * the calculation for this, we just return the same value as the 
   * predefined KILOMETERS_PER_DEGREE (which is the kilometers per degree
   * of latitude).
   * @param latitude the original latitude.
   * @return the kilometers per degree between longitudes at that latitude.
   */
  private double calculateKilometersPerLongitudeDegree(double latitude) {
    return KILOMETERS_PER_DEGREE;
  }

  /**
   * Convenience method to build a GeoResult object from a Lucene document.
   * @param doc the Lucene document object.
   * @param point the GeoPoint object for this result.
   * @param distanceKmFromOrigin the calculated distance from the origin.
   * @return a populated GeoResult object.
   */
  private GeoResult buildGeoResultFromDocument(Document doc, GeoPoint point, 
      Double distanceKmFromOrigin) {
    GeoResult result = new GeoResult();
    result.setName(doc.get("name"));
    result.setAddress(doc.get("address"));
    result.setPhone(doc.get("phone"));
    result.setCategory(doc.get("occupation"));
    result.setDistanceKmFromOrigin(distanceKmFromOrigin);
    result.setLatitude(point.getLatitude());
    result.setLongitude(point.getLongitude());
    return result;
  }
}

My first approach, which I call naiveSearch() above, is to simply build the Query and hit the index with it. I then iterate through the Hits returned, applying the distanceFrom() predicate to each result, and discarding results that are not in the circle defined by the origin and radius. Since my use case would be to force the user to specify a category, I don't get too many results after applying the category filter (maybe in the region of 20-30 results), so I just use Java Collections.sort() with a custom Comparator to return the values closest matched points first.

The category filter is a regular Lucene QueryFilter object, which is cached lazily. Since I expect to have a finite number of categories, it makes sense to have these be built and applied to the resulting Boolean RangeQuery instead of shoving the category query into the BooleanQuery, since that would be calculated every time rather than filtered against a cached (after the first time) BitSet.

Why did I do a Collections.sort() rather than use a custom Lucene Sort object instead. Well, I have used SortComparatorSource implementations in the past, but I have had performance problems with them. In retrospect, I understand I used them incorrectly, applying them against the entire resultset returned from a Query instead of using the approach recommended in the Lucene book. Also, I was a little concerned about having to instantiate a Sort object for every query, since I can't sort on the distance unless I know the origin, which changes per query.

Anyway, the recommended approach seems to be to get a TopFieldDocs from the searcher, limiting the result count. The book says thats the only way to use a Sort object in the query. I am guessing that used to be true for version 1.4 (which is when I bought the book), but is no longer true. The implementation in the book also pre-sorts all distances from a given location, so my implementation, which calculates distances per query, is probably unlikely to perform much better than the naive approach. Anyway, at least I got to work with the TopFieldDocs object.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// GeoSortComparatorSource.java
package com.mycompany.geosearch;

import java.io.IOException;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.ScoreDocComparator;
import org.apache.lucene.search.SortComparatorSource;
import org.apache.lucene.search.SortField;

/**
 * Custom Sorting for Distance calculations.
 */
public class GeoSortComparatorSource implements SortComparatorSource {

  private static final long serialVersionUID = -4338638868770017111L;
  private final Log log = LogFactory.getLog(getClass());
  private GeoPoint origin;
  
  public GeoSortComparatorSource(GeoPoint origin) {
    this.origin = origin;
  }
  
  public ScoreDocComparator newComparator(final IndexReader reader, final String fieldname) 
      throws IOException {
    return new ScoreDocComparator() {
      public int compare(ScoreDoc i, ScoreDoc j) {
        try {
          Document doc1 = reader.document(i.doc);
          Document doc2 = reader.document(j.doc);
          GeoPoint point1 = new GeoPoint(
            Double.valueOf(doc1.get("lon")), Double.valueOf(doc1.get("lat")));
          GeoPoint point2 = new GeoPoint(
            Double.valueOf(doc2.get("lon")), Double.valueOf(doc2.get("lat")));
          if (point1.distanceFrom(origin) < point2.distanceFrom(origin)) {
            return -1;
          } else if (point1.distanceFrom(origin) > point2.distanceFrom(origin)) {
            return 1;
          } else {
            return 0;
          }
        } catch (Exception e) {
          log.error(e);
          return 0;
        }
      }

      public int sortType() {
        return SortField.DOUBLE;
      }

      public Comparable sortValue(ScoreDoc i) {
        try {
          Document doc = reader.document(i.doc);
          GeoPoint point = new GeoPoint(
              Double.valueOf(doc.get("lon")), Double.valueOf(doc.get("lat")));
          return new Double(point.distanceFrom(origin));
        } catch (Exception e) {
          log.error(e);
          return new Double(0D);
        }
      }
    };
  }
}

I tried the above methods with 3 queries each, using the same origin point, and hitting it successively for 5, 10 and 20 kilometers with different categories to filter. Based on that the recommendedSearch() is about the same, performance wise, than the naiveSearch() method, clocking in at around 300-500ms on my laptop. If any of you can see issues with these approaches, or can think of opportunities to tune the algorithm, I would appreciate knowing.