Saturday, September 29, 2007

Using Lucene with Jython

In a previous post, I had described a workaround to using Lucene BooleanQueries using PyLucene. Basically, all this involved was to build the Query programatically using AND and OR boolean operators supplied by Lucene's Query Parser syntax before passing it to the PyLucene.QueryParser object.

However, I faced a slightly different problem now. My task was to quality check an index built using a custom Lucene Analyzer (written in Java). The base queries the user was expected to type into our search page was available as a flat file. The quality check involved converting the input query into a custom Lucene Query object, then apply a set of standard facets to the Query using a QueryFilter, and write the results of each IndexSearcher.search(Query,QueryFilter) call into another flat file.

Of course, the most logical solution would have been to write a Java JUnit test that did this. But this was kind of a one-off, and writing Java code seemed kind of wasteful. I had experimented with Jython once before, where I was looking for a way to call some Java standalone programs from the command line. So I decided to try the same approach of adding the JAR files I needed to Jython's sys.path.

So here is my code, which should be pretty much self explanatory. The script takes as input arguments the path to the Lucene index, the path to the input file of query strings and the path to the file where the report should be written.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
#!/opt/jython2.2/jython
import sys
import string

def usage():
  print " ".join([sys.argv[0], "/path/to/index/to/read", "/path/to/input/file", \
    "/path/to/output/file"])
  sys.exit(-1)

def main():
  # Command line processing
  if (len(sys.argv) != 4):
    usage()

  # Set up constants for reporting
  facetValues = ["value1", "value2", "value3", "value4", "value5"]

  # Add jars to classpath
  jars = [
    "/full/path/to/lucene.jar",
    "/full/path/to/our/custom/analyzer.jar"
    ... other dependency jars
    ]
  for jar in jars:
    sys.path.append(jar)

  # Import references
  from org.apache.lucene.index import Term
  from org.apache.lucene.queryParser import QueryParser
  from org.apache.lucene.search import IndexSearcher
  from org.apache.lucene.search import TermQuery
  from org.apache.lucene.search import QueryFilter
  from org.apache.lucene.store import FSDirectory
  from com.mycompany.analyzer import MyCustomAnalyzer

  # load up an array with the input query strings
  querystrings = []
  infile = open(sys.argv[2], 'r')
  outfile = open(sys.argv[3], 'w')
  while (True):
    line = infile.readline()[:-1]
    if (line == ''):
      break
    querystrings.append(line)
  
  # search for the query and facet
  dir = FSDirectory.getDirectory(sys.argv[1], False)
  analyzer = MyCustomAnalyzer()
  searcher = IndexSearcher(dir)
  for querystring in querystrings:
    for facetValue in facetValues:
      luceneQuery = buildCustomQuery(querystring)
      query = QueryParser("body", analyzer).parse(luceneQuery)
      queryfilter = QueryFilter(TermQuery(Term("facet", facetValue)))
      hits = searcher.search(query, queryfilter)
      numHits = hits.length()
      # if we found nothing for this query and facet, we report it
      if (numHits == 0):
        outfile.write("|".join([querystring, facetValue, 'No Title', 'No URL', '0.0']))
        continue
      # show upto the top 3 results for the query and facet
      for i in range(0, min(numHits, 3)):
        doc = hits.doc(i)
        score = hits.score(i)
        title = doc.get("title")
        url = doc.get("url")
        outfile.write("|".join([disease, facet, title, url, str(score)]))

  # clean up
  searcher.close()
  infile.close()
  outfile.close()

def buildCustomLuceneQuery(querystring):
  """ do some custom query building here """
  return query
  
if __name__ == "__main__":
  main()

Why is this so cool? As you can see, the Python code is quite simple. However, it allows me to access functionality embedded in our custom Lucene Analyzer written in Java, as well as access the newer features of Lucene 2.1 (PyLucene is based on Lucene 1.4) if I need them. So basically, I can now write what is essentially Java client code in the much more compact Python language. Also, if I had written a Java program, I would either have to call Java with a rather longish -classpath parameter, or build up a shell script or Ant target. With Jython, the script can be called directly from the command line.

There are some obvious downsides as well. Since I mostly use Python for scripting, I end up downloading and installing many custom modules for Python, that I don't necessarily install on my Jython installation. For example, for database access, I have modules installed for Oracle, MySQL and PostgreSQL. However, with Jython, we could probably just use JDBC for database access, as described in Andy Todd's blog post here. Overall, I think having access to Java code from within Python using Jython is quite useful.

Wednesday, September 19, 2007

SOAP Client for Amazon ECS with XFire

SOAP based Webservices are fairly ubiquitous nowadays, but so far, I had never had a need to build one. I have built several non-SOAP Webservices in the past, but they were all for internal use, so I used various light-weight remoting technologies such as Caucho's Burlap and Hessian, Spring's HttpInvoker, RMI and so on. All of these involve making the API JARs for the service available to the client somehow. SOAP is more like CORBA, in the sense that its WSDL file is similar to the CORBA IDL, and serves the same purpose. Given a WSDL, the client should be able to generate an API locally.

One of the nice things about being a developer is that, when given a tool or technology you are unfamiliar with, you can build in some time into your project to learn it. As a manager, that luxury is denied to me. I have seen managers at past jobs deal with this by doing the unfamiliar work themselves, and handing off the rest to their engineers, but having been on the recieving end of this transaction, I did not want to perpetuate it. Besides, I was working on another equally high-priority project at the time and did not have the bandwidth to commit to a delivery date for this one. So I ended up assigning this work to one of our engineers, which neither he nor I had done before.

Nevertheless, it made me uncomfortable that I did not know enough about what I was asking someone else to do. Besides, given the ubiquity of SOAP, being able to build a SOAP client should be part of the average Java developer's skillset. So I decided to do a small proof-of-concept to learn about how to build a SOAP Webservice client. The service I chose to hit was Amazon's E-Commerce System (Amazon ECS or AWS), a very comprehensive and well-built API that exposes almost every bit of information you can find on their website. You can see their WSDL file here.

For a toolkit, I first chose Axis2, but its wsdl2java tool fails with unsatisfied dependencies when I request adf or jibx data bindings, and hangs when I request XmlBeans bindings. For those unfamiliar with the term "data bindings" (as I was when I started last weekend), its just the generated Java beans representing the types defined in the WSDL file, and the parsing code to convert between the XML and Java. Colleagues have reported success generating APIs from WSDL files using IDEs such as Netbeans and IDEA, and I am sure Eclipse can do it too, but wsdl2java from Axis2 did not work for me on the command line.

I then chose Apache CXF, formerly known as XFire. XFire seems to be the more well-known name, which is why I used it in the title - in the rest of the post I will call it Apache-CXF. I had heard about Apache-CXF when checking out Spring remoting strategies. It is supposed to be lighter weight than Axis2, but I don't know enough about either project to elaborate on what that means. Anyway, its wsdl2java worked great, generating JAXB bindings for me by default. Here is the command I used to generate the API from the AWS WSDL file:

1
2
3
4
5
sujit@sirocco:~/tmp/apache-cxf-2.0.1-incubator/bin$ ./wsdl2java \
  -p net.sujit.amazon.generated \
  -client \
  -d /home/sujit/src/wsclient/src/main/java \
  http://webservices.amazon.com/AWSECommerceService/AWSECommerceService.wsdl

This generated a bunch of files in the net/sujit/amazon/generated subdirectory of my Maven2 application. At this point, I was now ready to build my client. Amazon is my (and probably the world's) favorite bookstore, so I decided to build a client that would search for and return a list of books based on a search string. To keep my client separate from the generated code, I put it in a parallel package to the generated code. The client code follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
package net.sujit.amazon.client;

import java.util.ArrayList;
import java.util.List;

import net.sujit.amazon.generated.AWSECommerceService;
import net.sujit.amazon.generated.AWSECommerceServicePortType;
import net.sujit.amazon.generated.Item;
import net.sujit.amazon.generated.ItemAttributes;
import net.sujit.amazon.generated.ItemSearch;
import net.sujit.amazon.generated.ItemSearchRequest;
import net.sujit.amazon.generated.ItemSearchResponse;
import net.sujit.amazon.generated.Items;

import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Logger;

public class MyWebServicesClient {

  private static final Logger LOGGER = Logger.getLogger(MyWebServicesClient.class);
  private static final String ACCESS_KEY = "MySuperSecretHexKey";
  private static final String BOOK_SEARCHINDEX = "Books";
  
  private AWSECommerceServicePortType client;
  
  public void init() throws Exception {
    AWSECommerceService service = new AWSECommerceService();
    this.client = service.getAWSECommerceServicePort();
  }
  
  public List<MyBook> getSearchResults(String keywords) {
    List<MyBook> myBooks = new ArrayList<MyBook>();
    ItemSearch itemSearch = new ItemSearch();
    itemSearch.setAWSAccessKeyId(ACCESS_KEY);
    ItemSearchRequest request = new ItemSearchRequest();
    request.setKeywords(keywords);
    request.setCondition("All");
    request.setSearchIndex(BOOK_SEARCHINDEX);
    request.getResponseGroup().add("ItemAttributes");
    itemSearch.getRequest().add(request);
    ItemSearchResponse response = client.itemSearch(itemSearch);
    List<Items> itemsList = response.getItems();
    for (Items items : itemsList) {
      List<Item> itemList = items.getItem();
      for (Item item : itemList) {
        MyBook myBook = new MyBook();
        myBook.setAsin(item.getASIN());
        myBook.setUrl(item.getDetailPageURL());
        ItemAttributes attributes = item.getItemAttributes();
        myBook.setTitle(attributes.getTitle());
        myBook.setAuthor(StringUtils.join(attributes.getAuthor().iterator(), ", "));
        myBook.setPublisher(attributes.getPublisher());
        myBook.setPublicationDate(attributes.getPublicationDate());
        myBooks.add(myBook);
      }
    }
    return myBooks;
  }
}

As you can see, my application client instantiates the service using the service stub and gets a reference to the underlying client proxy. This is done in the init() method. Once you have that, you are golden, and the rest of the code is just application code calling the remote methods via the proxy.

The getSearchResults() method represents the actual application code. The method takes a search string as its argument. It then instantiates an ItemSearch object with the AWS Access key, then builds an ItemSearchRequest with the search string, and assigns the ItemSearchRequest to the ItemSearch object. Executing the ItemSearch object's execute() method yields a ItemSearchResponse, which I then query to pull out the information from the web service into a List of MyBook view beans. The MyBook bean is a simple data holder, with the fields as shown below. The getters and setters have been omitted in the interest of space.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
package net.sujit.amazon.client;

public class MyBook {
  private String asin;
  private String title;
  private String url;
  private String author;
  private String publisher;
  private String publicationDate;
  ...  
}

More information on the various methods available in AWS and the parameters that they take are available in the Amazon ECS Web Developer's guide. As mentioned before, the service is very comprehensive, so its advisable to go through the guide if you want to do anything serious with it.

I use a JUnit test to actually call this method. This could also have been done using a main() method on the MyWebServicesClient.java file. The JUnit test looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
package net.sujit.amazon.client;

import java.util.List;

import org.apache.log4j.Logger;
import org.junit.Before;
import org.junit.Test;

public class MyWebServicesClientTest {

  private static final Logger LOGGER = Logger.getLogger(MyWebServicesClientTest.class);
  
  private MyWebServicesClient client;
  
  @Before
  public void setUp() throws Exception {
    client = new MyWebServicesClient();
    client.init();
  }
  
  @Test
  public void testGetSearchResults() throws Exception {
    List<MyBook> books = client.getSearchResults("java web services");
    LOGGER.debug("#-books:" + books.size());
    for (MyBook book : books) {
      LOGGER.debug("--");
      LOGGER.debug(book.toString());
    }
  }
}

The results for a search of "java web services" (see test above) return the first 10 results of the search. I did not actually build an UI for it, so I am not going to include the results here, but building an UI should be a very simple task.

As you can see, the actual code you have to write to build a SOAP client is minimal. In that sense, I can see why SOAP is so popular as an external Webservice framework, even though the XML itself is so horribly bloated compared to other XML based remoting protocols. Building a client should take very little time, if you know the things to do before you start. I outline the list of steps I had to do to get my SOAP client up and working.

  1. Download apache-cxf from the project website.
  2. Sign up for a free Webservices account with Amazon to get my access key.
  3. Generate a Maven2 Java application
  4. Add to my default pom.xml the JAR files listed in the "all CXF usage" section of the lib/WHICH_JARS file of the apache-cxf distribution. They missed the wsdl4j.jar file in the list which I added in later.
  5. Locally install JARs specified in above list but not already available in my repository using mvn install:install-file.
  6. Run the wsdl2java command so that the java files are generated in the right spot in my application.
  7. Develop the MyWebservicesClient.java file to define the services I should call for my use case.
  8. In addition, I defined a view bean (MyBook.java) to be able to collect the information out of the service into my application.
  9. Write a JUnit test to run the client.

I hope this post was useful. I did find some articles on the Internet about how to develop a SOAP web client, but most of these are from companies hawking their IDE, IDE plugins or other visual products, and you have to read between the lines to see how everything fits together. My goal of building the proof of concept described here was to understand how the whole thing works, something I do not get when someone points and clicks his way through an IDE or other visual tool.

Saturday, September 15, 2007

Jackrabbit Event Handling

In his article, Advanced Java Content Repository API, Sunil Patil says that two of the most popular advanced features of a JCR compliant content repository (one of which is Jackrabbit) are Versioning and Observation. Since I was already looking at Jackrabbit, I decided to check out these APIs a bit to see if I could find some use for them.

I can see the Versioning API being useful for organizations who actually generate their own content, and would need to track any changes made to documents. This is particularly true in industries with strong compliance requirements, such as Finance, Healthcare, etc. Although we do generate some amount of internal content, typically they don't need to be maintained and revised, they just expire after a period of time, so we don't really have a need for version history. So I read about it in Sunil Patil's article, but didn't make any effort to actually try it in my own use case.

The Observation API looked interesting. It allows you to register Listeners on various predefined events such as a Node being removed or added, and Properties being added, removed or changed. I got interested in it because I thought that perhaps we could use these events to trigger legacy code that did not depend on the repository. As before, I decided to use the JCR module from the Spring Modules Project to make integration with Spring easier.

As an experiment, I decided to use the Observation API to trap a content update event, which would then trigger off a Lucene index update. The content update consists of dropping the content node for the content, creating a new one, and re-inserting the properties back in. The code for ContentUpdater.java is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import java.io.File;
import java.io.IOException;

import javax.jcr.Node;
import javax.jcr.NodeIterator;
import javax.jcr.RepositoryException;
import javax.jcr.Session;
import javax.jcr.query.Query;
import javax.jcr.query.QueryManager;
import javax.jcr.query.QueryResult;

import org.apache.log4j.Logger;
import org.springframework.beans.factory.annotation.Required;
import org.springmodules.jcr.JcrCallback;
import org.springmodules.jcr.JcrTemplate;

public class ContentUpdater {

  private static final Logger LOGGER = Logger.getLogger(ContentUpdater.class);
  
  private String contentSource;
  private JcrTemplate jcrTemplate;
  private IParser parser;

  @Required
  public void setContentSource(String contentSource) {
    this.contentSource = contentSource;
  }

  @Required
  public void setJcrTemplate(JcrTemplate jcrTemplate) {
    this.jcrTemplate = jcrTemplate;
  }

  @Required
  public void setParser(IParser parser) {
    this.parser = parser;
  }

  public void update(final File file) {
    jcrTemplate.execute(new JcrCallback() {
      public Object doInJcr(Session session) throws IOException, RepositoryException {
        try {
          DataHolder dataHolder = parser.parse(file);
          String contentId = dataHolder.getProperty("contentId");
          Node contentSourceNode = getContentNode(session, contentSource, null);
          Node contentNode = getContentNode(session, contentSource, contentId);
          if (contentNode != null) {
            contentNode.remove();
          }
          contentNode = contentSourceNode.addNode("content");
          for (String propertyKey : dataHolder.getPropertyKeys()) {
            String value = dataHolder.getProperty(propertyKey);
            contentNode.setProperty(propertyKey, value);
          }
          session.save();
        } catch (Exception e) {
          throw new IOException("Parse error", e);
        }
        return null;
      }
    }); 
  }
  
  public Node getContentNode(final Session session, final String contentSource, 
      final String contentId) throws Exception {
    if (contentId == null) {
      return session.getRootNode().getNode(contentSource);
    }
    QueryManager queryManager = session.getWorkspace().getQueryManager();
    Query query = queryManager.createQuery("//" + contentSource + 
      "/content[@contentId='" + contentId + "']", Query.XPATH);
    QueryResult result = query.execute();
    NodeIterator ni = result.getNodes();
    if (ni.hasNext()) {
      Node contentNode = ni.nextNode();
      return contentNode;
    } else {
      return null;
    }
  }
}

When the session.save() happens, a bunch of events are thrown out by Jackrabbit to be picked up by any interested EventListener objects. We define one such EventListener which listens to one specific event generated by the ContentUpdater.java class, and handles it. The code for the ContentUpdatedEventListener.java is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import java.io.IOException;
import java.util.List;

import javax.jcr.Node;
import javax.jcr.NodeIterator;
import javax.jcr.Property;
import javax.jcr.PropertyIterator;
import javax.jcr.RepositoryException;
import javax.jcr.Session;
import javax.jcr.observation.Event;
import javax.jcr.observation.EventIterator;
import javax.jcr.observation.EventListener;
import javax.jcr.query.Query;
import javax.jcr.query.QueryManager;
import javax.jcr.query.QueryResult;

import org.apache.log4j.Logger;
import org.springframework.beans.factory.annotation.Required;
import org.springmodules.jcr.JcrCallback;
import org.springmodules.jcr.JcrTemplate;

/**
 * Event listener that gets called whenever a source File node changes.
 */
public class ContentUpdatedEventListener implements EventListener {

  private static final Logger LOGGER = Logger.getLogger(ContentUpdatedEventListener.class);
  
  private JcrTemplate jcrTemplate;
  private List<IEventHandler> eventHandlers;

  @Required
  public void setJcrTemplate(JcrTemplate jcrTemplate) {
    this.jcrTemplate = jcrTemplate;
  }

  @Required
  public void setEventHandlers(List<IEventHandler> eventHandlers) {
    this.eventHandlers = eventHandlers;
  }

  public void onEvent(final EventIterator eventIterator) {
    jcrTemplate.execute(new JcrCallback() {
      public Object doInJcr(Session session) throws IOException, RepositoryException {
        while (eventIterator.hasNext()) {
          Event event = eventIterator.nextEvent();
          if (event.getType() == Event.NODE_ADDED) {
            QueryManager queryManager = session.getWorkspace().getQueryManager();
            Query query = queryManager.createQuery("/" + event.getPath(), Query.XPATH);
            QueryResult result = query.execute();
            NodeIterator nodes = result.getNodes();
            if (nodes.hasNext()) {
              Node contentNode = nodes.nextNode();
              PropertyIterator properties = contentNode.getProperties();
              DataHolder dataHolder = new DataHolder();
              while (properties.hasNext()) {
                Property property = properties.nextProperty();
                dataHolder.setProperty(property.getName(), property.getValue().getString());
              }
              LOGGER.debug("Did I get here?");
              for (IEventHandler eventHandler : eventHandlers) {
                try {
                  eventHandler.handle(dataHolder);
                } catch (Exception e) {
                  LOGGER.info("Failed to handle event:" + event.getPath() +  
                      " of type:" + event.getType() + 
                      " by " + eventHandler.getClass().getName(), e);
                }
              }
            }
          }
        }
        return null;
      }
    });
  }
}

To make the design more modular and cleaner, the EventListener can be injected with a List of IEventHandler objects, whose handle() method gets called in a for loop, so multiple actions can happen when an event is trapped by the Listener. The IEventHandler.java code is shown below:

1
2
3
public interface IEventHandler {
  public void handle(DataHolder holder) throws Exception;
}

A dummy implementation that does nothing but prints that it is updating a Lucene index is shown below, for illustration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import org.apache.log4j.Logger;
import org.springframework.beans.factory.annotation.Required;

/**
 * A dummy class to demonstrate event handling.
 */
public class LuceneIndexUpdateEventHandler implements IEventHandler {

  private static final Logger LOGGER = Logger.getLogger(LuceneIndexUpdateEventHandler.class);
  private String indexPath;

  @Required
  public void setIndexPath(String indexPath) {
    this.indexPath = indexPath;
  }

  public void handle(DataHolder holder) throws Exception {
    LOGGER.info("Updated Lucene index at:" + indexPath);
  }
}

Finally, we tie it all together with Spring configuration. Here is the applicationContext.xml file. Refer to my last post for the complete applicationContext.xml file, I just show the diffs here to highlight the changes and explain them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
<beans ...>
  ...
  <bean id="jcrSessionFactory" class="org.springmodules.jcr.JcrSessionFactory">
    ...
    <property name="eventListeners">
      <list>
        <ref bean="contentUpdatedEventListenerDefinition"/>
      </list>
    </property>
  </bean>

  <!-- The updater -->
  <bean id="myRandomContentUpdater" class="com.mycompany.myapp.ContentUpdater">
    <property name="contentSource" value="myRandomContentSource"/>
    <property name="jcrTemplate" ref="jcrTemplate"/>
    <property name="parser" ref="someRandomDocumentParser"/>
  </bean>

  <!-- Linked to the EventListener via this bean -->
  <bean id="contentUpdatedEventListenerDefinition" class="org.springmodules.jcr.EventListenerDefinition">
    <property name="absPath" value="/"/>
    <property name="eventTypes" value="1"/><!-- Event.NODE_ADDED -->
    <property name="listener" ref="contentUpdatedEventListener"/>
  </bean>
  
  <!-- The EventListener -->
  <bean id="contentUpdatedEventListener" class="com.mycompany.myapp.ContentUpdatedEventListener">
    <property name="jcrTemplate" ref="jcrTemplate"/>
    <property name="eventHandlers">
      <list>
        <ref bean="luceneIndexUpdateEventHandler"/>
      </list>
    </property>
  </bean>

  <!-- The EventHandler -->
  <bean id="luceneIndexUpdateEventHandler" class="com.mycompany.myapp.LuceneIndexUpdateEventHandler">
    <property name="indexPath" value="/tmp/lucene"/>
  </bean>
  
</beans>

The first change is to register one or more EventListenerDefinition beans to the JcrSessionFactory. This is shown in the first block above. The second block is simply the configuration for the ContentUpdater. The third block is the EventListenerDefinition which says that the EventListener it defines listens to all events starting from root and fiters on event type 1 (Event.NODE_ADDED), and the actual reference to the EventListener bean. The fourth block is the definition and configuration for the ContentUpdatedEventListener EventListener implementation, which also takes in a List of IEventHandler objects. In our case the list contains only the reference to the dummy LuceneIndexUpdaterEventHandler class. The final block is the bean definition for the IEventHandler.

To run this code, I have a very simple JUnit harness that calls the ContentUpdater.update() method with a File reference. The node corresponding to the File is updated and an event sent, and we get to see a log message like the following in our logs. Notice that this log is usually emitted after JUnit's messages, signifying that this is called asynchronously.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.246 sec

Results :

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

13 Sep 2007 09:33:29,509 INFO  com.healthline.jrtest.LuceneIndexUpdateEventHandler 
com.healthline.jrtest.LuceneIndexUpdateEventHandler.handle(LuceneIndexUpdateEventHandler.java:25)
(ObservationManager, ): Updated Lucene index at:/tmp/lucene
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8 seconds
[INFO] Finished at: Thu Sep 13 09:33:29 PDT 2007
[INFO] Final Memory: 11M/86M
[INFO] ------------------------------------------------------------------------

The Observation API reminds me of a middleware application that I maintained for a while at a previous job, which was a bridge between our various home-grown content management systems and our actual publishing system. Events were sent as HTTP requests, and were converted into actual publishing requests by the application and sent to the publishing system. Jackrabbit's Observation API would be a perfect fit in this situation, and it would be so much more elegant.

As I was exploring the Versioning and Observation APIs, I had an epiphany. I realized the reason I have this whole love-hate thing (love the features, can't find enough reason to implement it) with Jackrabbit is because its targeted to a business model different from mine. Jackrabbit (and I am guessing any CMS in general) are targeted to businesses which tend to manage their content in individual pieces, such as news stories in a news company or product spec sheets for manufacturing companies, for example. Unlike them, we manage our content in bulk, regenerating all content from a content provider in batch mode. That may change in the future, and perhaps it would then be time to reconsider.

Saturday, September 08, 2007

Spring loaded Jackrabbit

So far I haven't been very enthusiastic about Jackrabbit, and yet I keep writing about it. My lack of enthusiasm stems from the fact that it would quite an effort to move our existing content to any content repository, which is stored as a combination of flat files, database tables and Lucene indexes, as well as keep up with the steady flow of new content we are licensing. We also have tools and gadgets which require more granular access than that provided through Jackrabbit's standard query API.

However, of late, almost everything I do seems to be driven by whether I can apply it readily, which, in retrospect, seems to be a bit short-sighted. This was driven home to me recently when I was asked to implement an idea I had suggested (and developed a proof of concept for my own understanding) about a year ago. So what seems to be impractical today may not be so a year from now, so it may be worth spending time on some technology today in the hope that maybe the knowledge would be useful down the line. In fact, that's one reason I started with this blog in the first place. And there is no doubt that Jackrabbit is cool technology, and while there are still warts, I expect it to mature enough to justify production-quality use by the time I am ready to use it.

That said, one of the things which make a particular software attractive to me is its ability to be integrated with the Spring Framework, only because I find Spring's IoC/dependency injection useful and hence tend to use it everywhere, from web applications to standalone Java projects. The Spring Modules project has built code to integrate with various other popular software, and one of them is JCR. Within the springmodules-jcr project, there is support for Jackrabbit and Jeceira, another open source CMS based on the JCR specifications.

Based upon an InfoQ article "Integrating Java Content Repository and Spring", written by Costin Leau, one of the developers on the Spring Modules project, I decided to rewrite my Content Loader and Retriever implementations that I described in my blog post two weeks ago, to use JcrTemplate and JcrCallback provided by springmodules-jcr, as well as let Spring build up my Repository and other objects using dependency injection.

First, the applicationContext.xml so you know how its all set up:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:util="http://www.springframework.org/schema/util"
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans 
       http://www.springframework.org/schema/beans/spring-beans-2.0.xsd
       http://www.springframework.org/schema/util 
       http://www.springframework.org/schema/util/spring-util-2.0.xsd">

  <bean id="repository" class="org.springmodules.jcr.jackrabbit.RepositoryFactoryBean">
    <property name="configuration" value="classpath:repository.xml"/>
    <property name="homeDir" value="file:/tmp/repository"/>
  </bean>
  
  <bean id="jcrSessionFactory" class="org.springmodules.jcr.JcrSessionFactory">
    <property name="repository" ref="repository"/>
    <property name="credentials">
      <bean class="javax.jcr.SimpleCredentials">
        <constructor-arg index="0" value="user"/>
        <constructor-arg index="1">
          <bean factory-bean="password" factory-method="toCharArray"/>
        </constructor-arg>
      </bean>
    </property>
  </bean>
  
  <bean id="password" class="java.lang.String">
    <constructor-arg index="0" value="password"/>
  </bean>
  
  <bean id="jcrTemplate" class="org.springmodules.jcr.JcrTemplate">
    <property name="sessionFactory" ref="jcrSessionFactory"/>
    <property name="allowCreate" value="true"/>
  </bean>

  <bean id="fileFinder" class="com.mycompany.myapp.FileFinder">
    <property name="filter" value=".xml"/>
  </bean>
  
  <bean id="someRandomDocumentParser" 
      class="com.mycompany.myapp.SomeRandomDocumentParser"/>
  
  <bean id="myRandomContentLoader" class="com.mycompany.myapp.ContentLoader2">
    <property name="fileFinder" ref="fileFinder"/>
    <property name="jcrTemplate" ref="jcrTemplate"/>
    <property name="contentSource" value="myRandomContentSource"/>
    <property name="parser" ref="someRandomDocumentParser"/>
    <property name="sourceDirectory" value="/path/to/my/random/content"/>
  </bean>
  
  <bean id="myRandomContentRetriever" class="com.mycompany.myapp.ContentRetriever2">
    <property name="jcrTemplate" ref="jcrTemplate"/>
  </bean>
    
</beans>    

The ContentLoader2.java is a version of ContentLoader.java which uses the springmodules-jcr API to work with Jackrabbit:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
package com.mycompany.myapp;

import java.io.File;
import java.io.IOException;
import java.util.List;

import javax.jcr.Node;
import javax.jcr.PathNotFoundException;
import javax.jcr.RepositoryException;
import javax.jcr.Session;

import org.apache.log4j.Logger;
import org.springframework.beans.factory.annotation.Required;
import org.springmodules.jcr.JcrCallback;
import org.springmodules.jcr.JcrTemplate;

public class ContentLoader2 {

  private static final Logger LOGGER = Logger.getLogger(ContentLoader2.class);
  
  private FileFinder fileFinder;
  private String sourceDirectory;
  private String contentSource;
  private IParser parser;
  private JcrTemplate jcrTemplate;
  
  @Required
  public void setFileFinder(FileFinder fileFinder) {
    this.fileFinder = fileFinder;
  }

  @Required
  public void setJcrTemplate(JcrTemplate jcrTemplate) {
    this.jcrTemplate = jcrTemplate;
  }

  @Required
  public void setContentSource(String contentSource) {
    this.contentSource = contentSource;
  }

  @Required
  public void setParser(IParser parser) {
    this.parser = parser;
  }

  @Required
  public void setSourceDirectory(String sourceDirectory) {
    this.sourceDirectory = sourceDirectory;
  }

  public void load() throws Exception {
    jcrTemplate.execute(new JcrCallback() {
      public Object doInJcr(Session session) throws IOException, RepositoryException {
        try {
          Node contentSourceNode = getFreshContentSourceNode(session, contentSource);
          List<File> filesFound = fileFinder.find(sourceDirectory);
          for (File fileFound : filesFound) {
            DataHolder dataHolder = parser.parse(fileFound);
            if (dataHolder == null) {
              continue;
            }
            LOGGER.info("Parsing file:" + fileFound);
            Node contentNode = contentSourceNode.addNode("content");
            for (String propertyKey : dataHolder.getPropertyKeys()) {
              String value = dataHolder.getProperty(propertyKey);
              LOGGER.debug("Setting property " + propertyKey + "=" + value);
              contentNode.setProperty(propertyKey, value);
            }
            session.save();
          }
        } catch (Exception e) {
          throw new IOException("Exception parsing and storing file", e);
        }
      }
    });
  }

  /**
   * Our policy is to do a fresh load each time, so we want to remove the contentSource
   * node from our repository first, then create a new one.
   * @param session the Repository Session.
   * @param contentSourceName the name of the content source.
   * @return a content source node. This is a top level element of the repository,
   * right under the repository root node.
   * @throws Exception if one is thrown.
   */
  private Node getFreshContentSourceNode(Session session, String contentSourceName) throws Exception {
    Node root = session.getRootNode();
    Node contentSourceNode = null;
    try {
      contentSourceNode = root.getNode(contentSourceName);
      if (contentSourceNode != null) {
        contentSourceNode.remove();
      }
    } catch (PathNotFoundException e) {
      LOGGER.info("Path for content source: " + contentSourceName + " not found, creating");
    }
    contentSourceNode = root.addNode(contentSourceName);
    return contentSourceNode;
  }
}

The ContentRetriever2.java, like the ContentLoader2.java, is a version of the original ContentRetriever.java file that works with Jackrabbit using the springmodules-jcr API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
package com.mycompany.myapp;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import javax.jcr.Node;
import javax.jcr.NodeIterator;
import javax.jcr.Property;
import javax.jcr.PropertyIterator;
import javax.jcr.RepositoryException;
import javax.jcr.Session;
import javax.jcr.query.Query;
import javax.jcr.query.QueryManager;
import javax.jcr.query.QueryResult;

import org.springmodules.jcr.JcrCallback;
import org.springmodules.jcr.JcrTemplate;

public class ContentRetriever2 {

  private JcrTemplate jcrTemplate;

  public void setJcrTemplate(JcrTemplate jcrTemplate) {
    this.jcrTemplate = jcrTemplate;
  }

  @SuppressWarnings("unchecked")
  public List<DataHolder> findAllByContentSource(final String contentSource) {
    return (List<DataHolder>) jcrTemplate.execute(new JcrCallback() {
      public Object doInJcr(Session session) throws IOException, RepositoryException {
        List<DataHolder> contents = new ArrayList<DataHolder>();
        Node contentSourceNode = session.getRootNode().getNode(contentSource);
        NodeIterator ni = contentSourceNode.getNodes();
        while (ni.hasNext()) {
          Node contentNode = ni.nextNode();
          String contentId = contentNode.getProperty("contentId").getValue().getString();
          contents.add(getContent(contentSource, contentId));
        }
        return contents;
      }
    });
  }
  
  public DataHolder getContent(final String contentSource, final String contentId) {
    return (DataHolder) jcrTemplate.execute(new JcrCallback() {
      public Object doInJcr(Session session) throws IOException, RepositoryException {
        DataHolder dataHolder = new DataHolder();
        QueryManager queryManager = session.getWorkspace().getQueryManager();
        Query query = queryManager.createQuery("//" + contentSource + 
          "/content[@contentId='" + contentId + "']", Query.XPATH);
        QueryResult result = query.execute();
        NodeIterator ni = result.getNodes();
        while (ni.hasNext()) {
          Node contentNode = ni.nextNode();
          PropertyIterator pi = contentNode.getProperties();
          dataHolder.setProperty("contentSource", contentSource);
          while (pi.hasNext()) {
            Property prop = pi.nextProperty();
            dataHolder.setProperty(prop.getName(), prop.getValue().getString());  
          }
          break;
        }
        return dataHolder;
      }
    });
  }
  
  public DataHolder getContentByUrl(final String contentSource, final String url) {
    return (DataHolder) jcrTemplate.execute(new JcrCallback() {
      public Object doInJcr(Session session) throws IOException, RepositoryException {
        DataHolder dataHolder = null;
        QueryManager queryManager = session.getWorkspace().getQueryManager();
        Query query = queryManager.createQuery("//" + contentSource + 
          "/content[@cfUrl='" + url + "']", Query.XPATH);
        QueryResult result = query.execute();
        NodeIterator ni = result.getNodes();
        while (ni.hasNext()) {
          Node contentNode = ni.nextNode();
          String contentId = contentNode.getProperty("contentId").getValue().getString();
          dataHolder = getContent(contentSource, contentId);
          break;
        }
        return dataHolder;
      }
    });
  }
}

If you compared the code above to my older post, there is not much difference. The old code is now encapsulated inside of a JcrCallback anonymous inner class implementation, which is called from a JcrTemplate.execute() method. The other thing that has changed is that I no longer build my JCR Repository and Session objects in my code anymore. Also there is no Repository.login() calls in my code, because Spring already logged me in. However, one of the most important differences is the absence of checked Exceptions being thrown from the code. JcrTemplate converts the checked RepositoryException and IOException raised from the calls to JCR code into unchecked ones.

There is obviously a lot about Jackrabbit, JCR and springmodules-jcr that I don't know yet. From my limited knowledge, it looks like a product with lots of promise, even though I don't think its useful to me right now. I plan to keep looking some more, and over the next few weeks, write about the features I think will be useful to me if I ever end up setting up one in a real environment.