Monday, May 30, 2011

Python JSON Prettifier

I have written earlier about my XML Viewer utility. Over the last few years, I have found good use for it, mainly for making sense of large compressed XML files. On a recent project into which I got pulled in, we needed to make sense of (you guessed it) large compressed JSON files.

I was not working with these files directly myself, but after seeing one colleague peering through the files using a text editor, and another writing horrendously complex regular expressions to extract control data (ie how many records, etc) from these files, I realized that it may be useful to build something along the lines of my XML viewer, but for JSON. Because, you know, I may be the one doing these things next :-).

Looking around, I found the JSONFormat online service, which worked beautifully for the medium size JSON file I threw into it, but I figured it would be nice to have something I could run off the command line (without an internet connection when I am on my commute, for example). Besides with a command line tool I could use other Unix tools like grep and wc to get quick stats on the files without having to do any cut-n-paste madness.

My JSON library of choice for Java work is the Jackson JSON processor, but since this was going to be a script, I wanted to use Python to build it. I initially tried jsonlib, because it promised pretty-printing support. Unfortunately, the input file was had characters encoded in ISO-8859-1 (copy-pasting data from MS-Office will do that to you), and several frustrating hours later, I ditched jsonlib in favor of the built-in json (formerly simplejson) library for reasons completely unrelated to the capabilities of either one.

I finally figured out how to deal with the ISO-8859-1 encoding using the codecs library (also built in for Python 2.6), but the built in pretty printing did not work quite the way I had hoped (ie, like jsonformat.com), so I figured that I could easily write my own - after all, a JSON object is just an arbitary structure composed of embedded lists and dictionaries.

So my approach was to use the codecs library to read the file into a Python object using the appropriate (user supplied) encoding. Strings in this Python object are stored as Unicode. My code then recurses through the object, putting in the necessary newlines and indents, and returning the strings as plain old ASCII strings, replacing non-ASCII characters with its XML entity reference string. Here is the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#! /usr/bin/python
# -*- coding: utf-8 -*-
#
# Converts the specified JSON file with the specified encoding
# (default encoding is utf8 if not specified) into a formatted 
# version with the specified indent (2 if not specified) and 
# writes to STDOUT.
#
# Usage: jsoncat.py inputfile.json [input_encoding] [indent]
#
import sys
import json
import codecs

def usage():
  print "Usage: %s input.json [input_encoding] [indent]" % sys.argv[0]
  print "input.json - the input file to read"
  print "input_encoding - optional, default utf-8"
  print "indent - optional, default 2"
  sys.exit(-1)

def prettyPrint(obj, indent, depth, suffix=False):
  if isinstance(obj, list):
    sys.stdout.write("%s[\n" % str(" " * indent * depth))
    sz = len(obj)
    i = 1
    for row in obj:
      prettyPrint(row, indent, depth + 1, i < sz)
      i = i + 1
    ct = "," if suffix else ""
    sys.stdout.write("%s]%s\n" % (str(" " * indent * depth), ct))
  elif isinstance(obj, dict):
    sys.stdout.write("%s{\n" % str(" " * indent * depth))
    sz = len(obj.keys())
    i = 1
    for key in obj.keys():
      sys.stdout.write("%s%s :" % (str(" " * indent * (depth + 1)), qq(key)))
      val = obj[key]
      if isinstance(val, list) or isinstance(val, dict):
        prettyPrint(val, indent, depth + 1, i < sz)
      else:
        prettyPrint(val, 1, 1, i < sz)
      i = i + 1
    ct = "," if suffix else ""
    sys.stdout.write("%s}%s\n" % (str(" " * indent * depth), ct))
  else:
    ct = "," if suffix else ""
    sys.stdout.write("%s%s%s\n" % (str(" " * indent * depth), qq(obj), ct))

def qq(obj):
  if isinstance(obj, unicode):
    return "\"" + obj.encode('ascii', 'xmlcharrefreplace') + "\""
  else:
    return repr(obj)

def main():
  if len(sys.argv) < 2 and len(sys.argv) > 4:
    usage()
  encoding = "utf-8" if len(sys.argv) == 2 else sys.argv[2]
  indent = 2 if len(sys.argv) == 3 else sys.argv[3]
  infile = codecs.open(sys.argv[1], "r", encoding)
  content = infile.read()
  infile.close()
  jsobj = json.loads(content)
  prettyPrint(jsobj, indent, 0)
  
if __name__ == "__main__":
  main()

To use it, you would supply it with a JSON file and an optional encoding and an optional indentation size. If no encoding is specified, then the script assumes UTF-8. If no indent is specified, the scripts assumes 2 characters. Output is written to console, which you can capture into a file using standard Unix redirection, or pipe through various Unix filters if it makes sense.

Saturday, May 21, 2011

Customizing Lucene's Fast Vector Highlighter

Motivation & Background

I have never had to deal with highlighting search results before. Our approach so far has been to generate summaries statically at index time, using an algorithm that computes the "most representative" summary for the document. On the other hand, both highlighters provided by Lucene (Highlighter and Fast Vector Highlighter, aka FVH) attempt to extract the portion(s) of the text relevant to the query term, and highlight the resulting snippet.

The multiple relevant fragments are then joined together (with ellipsis) to form a snippet. Our requirements were slightly different - we needed to find the first occurrence of the term in the text, then start with the containing sentence for a fixed number of characters (250 in our case). The resulting snippet would then be highlighted appropriately. In case of the title, the entire string would be considered for highlighting (ie no snippet extraction required).

My first attempt at building this (using Javadocs and my IDE) ended in abject, utter failure. For some reason, the APIs for both modules are insanely complex (at least to me). In desperation (I was under a time crunch), I coded up an implementation based on Grant Ingersoll's post on SpanQuery and Highlighting - it mostly worked, but I needed to be very gentle analyzing the content (I used the WhitespaceAnalyzer and a home grown PunctuationAnalyzer which chopped up terms containing punctuation into two separate tokens), and had to write some pretty hairy (and brittle) custom code to convert the term positions in the matched Spans to character positions needed for snippet generation and highlighting. A colleague later wrote up an improved version using the old Highlighter, but that had some analysis issues as well, which needed custom hacks that made it quite slow to use.

Attempting to understand this code and possibly improve its performance, I checked out the code snippets and the associated explanations for both modules in the LIA2 Book, where it states that FVH is faster and more feature-rich compared to the Highlighter.

Of course, this functionality comes at the cost of increased disk space required to store the index (and associated OS level cache). In addition to requiring the body to be stored (Highlighter), the FVH also requires its term vectors, offsets and positions to be stored as well. Of course, an alternative is to pull the body from an external datastore and create a temporary in-memory index (RAMDirectory) for the slice of results being rendered, but I am unsure how it would scale. I guess I should run some tests to find out...

Approach

Anyway, turns out that building a custom highlighter using the FVH is not as hard as I had thought, although it does need a small change to the 3.1 Lucene code.

The FVH is built using a FragListBuilder and a FragmentsBuilder. The FragmentsBuilder is responsible for generating the fragments that will ultimately form the snippet, and highlighting them with the query term(s). The term positions which it needs for highlighting are buried deep within it, namely the Toff in the list of SubInfo in the list of WeightedFragInfo objects. The FragListBuilder seems to be the place where the actual match offsets are computed (although not completely sure about this). I found this by stepping through the LIA2 example code through a debugger.

Based on these findings, I figured I could do what I needed to do by creating my own implementation of FragmentsBuilder, so thats what I did. The rest of the post describes the code and the other things I had to do to get this going.

Hacking Lucene Code

Calling this a hack is probably a bit grandiose, its basically a one word change in the FieldFragList class. Before my change, the fragInfos variable had default visibility, so FragmentsBuilder subclasses provided by Lucene could directly access this object, but not custom implementations such as mine. So all I did was to download the 3.1 version from the Lucene 3.1 SVN repository, and changed the line in FieldFragList.java from:

1
  List<WeightedFragInfo> fragInfos = new ArrayList<WeightedFragInfo>();
to:
1
  public List<WeightedFragInfo> fragInfos = new ArrayList<WeightedFragInfo>();

and recompiled the source, and replaced the lucene-highlighter-3.1.0.jar with the one I just generated.

This allows my subclass to get at the list of the WeightedFragInfo in its superclass (SimpleFragmentsBuilder in my case), which it needs for finding the match offsets.

Update (2011-05-30) - I opened LUCENE-3141 requesting that this field be made accessible in future releases. It was accepted and checked into trunk and the 3x version, so future Lucene versions will expose a FieldFragList.getFragInfos() and the hack above won't be required.

Custom Highlighter

I started with the code snippet in the LIA2 book (Section 8.4, pages 275-277), building it as a JUnit unit test, then gradually modified it to do what I needed. Here is the final code.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
package com.mycompany.hilite;

import java.io.File;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.collections.Predicate;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.commons.lang.math.Range;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Field.TermVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.vectorhighlight.FastVectorHighlighter;
import org.apache.lucene.search.vectorhighlight.FieldFragList;
import org.apache.lucene.search.vectorhighlight.FieldQuery;
import org.apache.lucene.search.vectorhighlight.FragListBuilder;
import org.apache.lucene.search.vectorhighlight.FragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragListBuilder;
import org.apache.lucene.search.vectorhighlight.SimpleFragmentsBuilder;
import org.apache.lucene.search.vectorhighlight.FieldFragList.WeightedFragInfo;
import org.apache.lucene.search.vectorhighlight.FieldFragList.WeightedFragInfo.SubInfo;
import org.apache.lucene.search.vectorhighlight.FieldPhraseList.WeightedPhraseInfo.Toffs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

public class CustomFastVectorHighlighterTest {

  @Test
  public void testBookExample() throws Exception {
    makeIndex();
    searchIndex();
  }

  // data
  private static final String[] TITLES = new String[] { ... };
  private static final String[] CONTENTS = new String[] { ... };
  
  // configuration
  private static final String PRE_TAG = 
    "<span style=\"background-color:#ffff00\">";
  private static final String POST_TAG = "</span>";
  private static final int CONTENT_FRAG_LEN = 250;
  private static final int TITLE_FRAG_LEN = -1;
  private static final String[] FIELDS = new String[] {"title", "content"};
  private static final QUERY_STRING = "some phrase";

  private static Directory DIR = new RAMDirectory();
  
  private void makeIndex() throws IOException {
    IndexWriterConfig iwconf = new IndexWriterConfig(
      Version.LUCENE_30, getAnalyzer());
    IndexWriter writer = new IndexWriter(DIR, iwconf);
    int ndocs = TITLES.length;
    for (int i = 0; i < ndocs; i++) {
      Document doc = new Document();
      doc.add(new Field("title", TITLES[i], 
        Store.YES, Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
      doc.add(new Field("content", CONTENTS[i], 
        Store.YES, Index.ANALYZED, TermVector.WITH_POSITIONS_OFFSETS));
      writer.addDocument(doc);
    }
    writer.close();
  }

  private void searchIndex() throws Exception {
    IndexSearcher searcher = new IndexSearcher(DIR);
    for (String field : FIELDS) {
      QueryParser parser = new QueryParser(
        Version.LUCENE_30, field, getAnalyzer());
      Query q = parser.parse(field + ":\"" + QUERY_STRING + "\"");
      int fragLen = "content".equals(field) ? 
        CONTENT_FRAG_LEN : TITLE_FRAG_LEN; 
      FastVectorHighlighter highlighter = getHighlighter(fragLen);
      FieldQuery fq = highlighter.getFieldQuery(q);
      ScoreDoc[] hits = searcher.search(q, TITLES.length).scoreDocs;
      for (ScoreDoc hit : hits) {
        String snippet = highlighter.getBestFragment(
          fq, searcher.getIndexReader(), hit.doc, field, CONTENT_FRAG_LEN);
        System.out.println(field + "=[" + snippet +
          ("content".equals(field) ? "..." : "") + "]");
      }
    }
    searcher.close();
  }

  @SuppressWarnings("unchecked")
  private Analyzer getAnalyzer() throws IOException {
    final Set<String> stopset = new HashSet<String>();
    List<String> lines = FileUtils.readLines(
      new File("/path/to/stopwords.txt"));
    for (String line : lines) {
      if (StringUtils.isEmpty(line) || line.startsWith("#")) {
        continue;
      }
      stopset.add(StringUtils.trim(line));
    }
    return new Analyzer() {
      @Override
      public TokenStream tokenStream(String fieldName, Reader reader) {
        Tokenizer tokenizer = new StandardTokenizer(Version.LUCENE_30, reader);
        TokenFilter filter = new LowerCaseFilter(Version.LUCENE_31, tokenizer);
        filter = new StopFilter(Version.LUCENE_30, filter, stopset); 
        filter = new PorterStemFilter(filter);
        return filter;
      }
    };
  }

  private FastVectorHighlighter getHighlighter(int fragLen) {
    FragListBuilder fragListBuilder = new SimpleFragListBuilder();
    FragmentsBuilder fragBuilder = new MyFragmentsBuilder(
      PRE_TAG, POST_TAG, fragLen);
    return new FastVectorHighlighter(true, true, 
      fragListBuilder, fragBuilder);
  }

  private class MyFragmentsBuilder extends SimpleFragmentsBuilder {

    private int fragLen;
    private String pretag;
    private String posttag;
    
    public MyFragmentsBuilder(String pretag, String posttag, int fragLen) {
      super(new String[] {pretag}, new String[] {posttag});
      this.pretag = pretag;
      this.posttag = posttag;
      this.fragLen = fragLen;
    }
    
    @Override
    public String createFragment(IndexReader reader, int docId,
        String fieldName, FieldFragList fieldFragList ) 
        throws IOException {
      // read the source string back from the index
      Document doc = reader.document(docId);
      String source = doc.get(fieldName);
      if (StringUtils.isEmpty(source)) {
        return source;
      }
      // find the occurrences of the matched phrase
      List<Range> termPositions = new ArrayList<Range>();
      List<WeightedFragInfo> fragInfos = fieldFragList.fragInfos;
      for (WeightedFragInfo fragInfo : fragInfos) {
        List<SubInfo> subInfos = fragInfo.getSubInfos();
        for (SubInfo subInfo : subInfos) {
          List<Toffs> termOffsets = subInfo.getTermsOffsets();
          for (Toffs termOffset : termOffsets) {
            Range termPosition = new IntRange(
              termOffset.getStartOffset(), termOffset.getEndOffset());
            termPositions.add(termPosition);
          }
        }
      }
      if (termPositions.size() == 0) {
        return StringUtils.substring(source, 0, fragLen);
      }
      int startFragPosition = 0;
      int endFragPosition = 0;
      // read back on the char array until we find a period,
      // then read front until we find a letter/digit. This
      // is our fragment start position. If no period found,
      // then this must be the first sentence, start here.
      if (fragLen < 0) {
        // we don't need a fragLen for titles, take them whole
        // so in this case fragLen should be -1.
        endFragPosition = source.length();
      } else {
        int startTermPosition = termPositions.get(0).getMinimumInteger();
        char[] sourceChars = source.toCharArray();
        for (int i = startTermPosition; i >= 0; i--) {
          if (sourceChars[i] == '.') {
            startFragPosition = i;
            break;
          }
        }
        for (int i = startFragPosition; i < sourceChars.length; i++) {
          if (Character.isLetterOrDigit(sourceChars[i])) {
            startFragPosition = i;
            break;
          }
        }
        endFragPosition = 
          Math.min(startFragPosition + fragLen, sourceChars.length);
      }
      // return the substring bounded by start and end, highlighting
      // the matched phrase
      final Range fragRange = 
        new IntRange(startFragPosition, endFragPosition);
      CollectionUtils.filter(termPositions, new Predicate() {
        public boolean evaluate(Object obj) {
          Range r = (Range) obj;
          return (fragRange.containsRange(r));
        }
      });
      if (termPositions.size() == 0) {
        // unlikely, since we are pretty sure that there is at least
        // one term position in our fragRange, but just being paranoid
        return StringUtils.substring(source, startFragPosition, endFragPosition);
      }
      StringBuilder buf = new StringBuilder();
      buf.append(StringUtils.substring(
        source, startFragPosition, 
        termPositions.get(0).getMinimumInteger()));
      int numHighlights = termPositions.size();
      for (int i = 0; i < numHighlights; i++) {
        buf.append(pretag).
          append(StringUtils.substring(source, 
            termPositions.get(i).getMinimumInteger(), 
            termPositions.get(i).getMaximumInteger())).
          append(posttag);
        if (i < numHighlights - 1) {
          buf.append(StringUtils.substring(source, 
            termPositions.get(i).getMaximumInteger(), 
            termPositions.get(i+1).getMinimumInteger()));
        }
      }
      buf.append(StringUtils.substring(source, 
        termPositions.get(numHighlights-1).getMaximumInteger(), 
        fragRange.getMaximumInteger())); 
      return buf.toString();
    }    
  }
}

The TITLE and CONTENT data were created by running the QUERY_STRING against an index to get the top 10 document's values for these fields. In a real (ie non unit test) situation, these would be the actual search results for the page which is required to be highlighted.

The makeIndex() method creates an in-memory Lucene index and populates it with documents with titles and contents from the TITLE and CONTENT arrays. As I stated before, I like this approach of getting these values from an external source, that way I can keep the index size down.

In the searchIndex() method, we make two FieldQuery queries against the in-memory index, once for title and once for content. The highlighter is called for each document that are returned from the hits.

The main difference between the LIA2 book example and this one is, of course, the custom FragmentsBuilder that creates the single highlighted fragment containing the first occurrence of the matched phrase in its createFragment() method. It first extracts the match position occurrences into a List of Range objects. It then looks backward from the beginning of the first range till it finds a sentence terminator (ie a period), then looks forward for a letter or digit. This is the start position of our snippet. It then just counts a fixed number of characters forward (250 in our case) and that is our end position. We then decorate the snippet with the highlights using the position occurrences that occur in our snippet range.

Here is some sample output using the results of the highlighting in my unit test.


Heart attack
A heart attack occurs when the blood flow that carries oxygen to the heart is blocked. The heart muscle becomes starved for oxygen and begins to die. Seeæ heart attack for more specific causes. ...

Jaw pain and heart attacks
Tooth pain and heart attacks; Heart attacks and jaw pain Question: Can pain in the jaw or teeth be an indication of a heart attack ? Answer: Sometimes. Heart pain can radiate to the jaw and teeth. It is more common for heart-related di...

Heart attack : Symptoms
Chest pain is a major symptom of heart attack. You may feel the pain in only one part of your body, or it may move from your chest to your arms, shoulder, neck, teeth, jaw, belly area, or back. The pain can be severe or mild. It can feel like: ...


Advantages

I find that there are a number of advantages to using FVH (compared to say, the Highlighter or the SpanQuery approaches I spoke of earlier).

  • Analysis happens once, and can therefore be as intense as desired - in my case, I started out with Standard, but now I have Standard, Lowercase, Stop and Filter, with no downstream hacks to accomodate the increased stemming.
  • FVH supports phrase queries - the Highlighter works at a term level, so sometimes you can get non-intuitive results. The SpanQuery approach does work on the phrase level, but is not as robust as the FVH.
  • FVH allows multi-colored tags - I don't really need it at the moment, and if I do need it, it would be for two classes of queries, ie my QUERY_STRING would be something like "(Q11 OR Q12 OR ...) OR (Q21 OR Q22 OR ...)" and the first group would need one color and the second group would need another. Not sure if (and how) that can be done with the FVH. I guess I'll find out when I need it.

Saturday, May 14, 2011

Custom Sorting in Solr using External Database Data

Recently, someone (from another project) asked me if I knew how to sort data in Solr by user-entered popularity rating data. The data would be entered as thumbs up/down clicks on the search results page and be captured into a RDBMS. Users would be given the option to sort the data by popularity (as reflected by the percentage of thumbs-up ratings vs thumbs-down ratings).

I knew how to do this with Lucene (using a ScoreDocComparator against an in-memory cache of the database table set to a relatively short expiration time), but I figured that since Solr is used so heavily in situations where user feedback is king, there must be something built into Solr, so I promised to get back to him after I did some research (aka googling, asking on the Solr list, etc).

A quick scan on Google pointed me to this Nabble thread, circa 2007, which seemed reasonable, but still required some customizing Solr (code). So I asked on the Solr list, and as expected, Solr did offer a way to do this (without customizing Solr code), using ExternalFileField field types. The post on tailsweep offers more details about this approach.

We saw some problems with this approach, however... First off, the ExternalFileField is not used for sorting, rather it is used to influence the scores using a function query, so the results did not seem quite deterministic. Second, the process of updating ratings involves writing out the database table into a flat file and copying it into Solr's index directory followed by a commit. Can be scripted of course, but it involves making yet another copy of the database. Third, since the approach does not provide a way to pull the popularity rank information in the results, the client application would have to make a separate lookup against the database (bringing up the issue of stale data, since the database could be ahead of the file version at a point in time).

For these reasons, I decided to explore the programmatic approach outlined in the Nabble page above. The approach involves the following steps.

  • Create a custom FieldType. The getSortField() method in this class should return a custom SortField using a custom FieldComparatorSource object (more on that couple lines down).
  • Configure this field type and a field of that type in your Solr schema.xml.
  • Build a custom FieldComparatorSource that returns a custom FieldComparator in its newComparator() method. The FieldComparator should use data from the database to do its ordering.
  • Once done, results can be sorted by &sort=rank+desc (or asc) in the request URL.

I had initially thought of using a short-lived cache to proxy the database call through, thus ensuring that any entry is at most, say 5 minutes old. But I was worried about the performance impact of this, so I decided to go with the approach taken by ExternalFileField and use an in-memory map which is refreshed periodically via a commit. Further, since I was writing code in Solr anyway, I decided to provide the thumbs-up/down percentages in the Solr response. Here is what I ended up doing.

Database

Assuming that the data is contained in a table with the following structure. The uid refers to a unique content id assigned by the application that created the content (in our case it is a MD5 hash of the URL).

1
2
3
4
5
6
7
8
+-------+----------+------+-----+---------+-------+------------------------+
| Field | Type     | Null | Key | Default | Extra | Comment                |
+-------+----------+------+-----+---------+-------+------------------------+
| uid   | char(32) | NO   | PRI | NULL    |       | unique content id      |
| n_up  | int(11)  | NO   |     | 0       |       | number of thumbs up    |
| n_dn  | int(11)  | NO   |     | 0       |       | number of thumbs down  |
| n_tot | int(11)  | NO   |     | 0       |       | total votes            |
+-------+----------+------+-----+---------+-------+------------------------+

The rank is calculated as shown below. The raw n_up and n_dn data are converted to rounded integer percentages. Then the rank is computed as the difference between the percentages, returning a number between -100 and +100. In order to keep the rank positive for scoring, we rebase the rank by 100.

1
2
3
  thumbs_up = round(n_up * 100 / n_tot)
  thumbs_dn = round(n_dn * 100 / n_tot)
  rank = thumbs_up - thumbs_dn + 100

Custom Field Type

The first step is to create the custom Rank FieldType, which is fairly trivial. I was writing this stuff against a slightly outdated SVN version of the Solr and Lucene code (I already had an index lying around here, so I decided to reuse that instead of building one for Solr 1.4.1/Lucene 2.9.2 version that I use at work. There is a slight difference in API for the Solr 1.4.1 version, you need to implement another write method, but you can just adapt it from an existing FieldType like I did with this one).

Here is the code for the custom FieldType. Of particular interest is the getSortField() method, which returns a RankFieldComparatorSource instance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Source: src/java/org/apache/solr/schema/ext/RankFieldType.java
package org.apache.solr.schema.ext;

import java.io.IOException;

import org.apache.lucene.document.Fieldable;
import org.apache.lucene.search.SortField;
import org.apache.solr.response.TextResponseWriter;
import org.apache.solr.schema.FieldType;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.ext.RankFieldComparatorSource;

public class RankFieldType extends FieldType {

  @Override
  public SortField getSortField(SchemaField field, boolean top) {
    return new SortField(field.getName(), 
      new RankFieldComparatorSource(), top);
  }

  @Override
  // copied verbatim from GeoHashField method
  public void write(TextResponseWriter writer, String name, Fieldable f)
      throws IOException {
    writer.writeStr(name, f.stringValue(), false);
  }
}

The next step is to configure this field type and a field of this type into the schema.xml file. The relevant snippets of my updated schema.xml file are shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
<!-- Source: solr/example/.../conf/schema.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="adam" version="1.3">
  <types>
    ...
    <fieldType name="rank_t" class="org.apache.solr.schema.ext.RankFieldType"/>
  </types>
 <fields>
   ...
   <field name="rank" type="rank_t" indexed="true" stored="true"/>
 </fields>
 ...
</schema>

Custom Field Comparator

The third step is to build our custom FieldComparatorSource and FieldComparator classes. Since my custom FieldComparator is unlikely to be called from any place other than my custom FieldComparatorSource, I decided to package them into a single file, with the FieldComparator defined as an inner class.

The custom FieldComparator is copied verbatim from the DocFieldComparator. The only difference is that instead of returning raw docIDs, these methods return the rank at these docIDs. Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// $Source$ src/java/org/apache/solr/search/ext/RankFieldComparatorSource.java
package org.apache.solr.search.ext;

import java.io.IOException;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.search.FieldComparator;
import org.apache.lucene.search.FieldComparatorSource;
import org.apache.solr.util.ext.RankUpdateListener;

/**
 * Called from RankFieldType.getSortField() method.
 */
public class RankFieldComparatorSource extends FieldComparatorSource {

  @Override
  public FieldComparator newComparator(String fieldname, int numHits,
      int sortPos, boolean reversed) throws IOException {
    return new RankFieldComparator(numHits);
  }

  // adapted from DocFieldComparator
  public static final class RankFieldComparator extends FieldComparator {
    private final int[] docIDs;
    private int docBase;
    private int bottom;

    RankFieldComparator(int numHits) {
      docIDs = new int[numHits];
    }

    @Override
    public int compare(int slot1, int slot2) {
      return getRank(docIDs[slot1]) - getRank(docIDs[slot2]);
    }

    @Override
    public int compareBottom(int doc) {
      return getRank(bottom) - getRank(docBase + doc);
    }

    @Override
    public void copy(int slot, int doc) {
      docIDs[slot] = docBase + doc;
    }

    @Override
    public FieldComparator setNextReader(IndexReader reader, int docBase) {
      this.docBase = docBase;
      return this;
    }
    
    @Override
    public void setBottom(final int bottom) {
      this.bottom = docIDs[bottom];
    }

    @Override
    public Comparable<?> value(int slot) {
      return getRank(docIDs[slot]);
    }
    
    private Integer getRank(int docId) {
      return RankUpdateListener.getRank(docId);
    }
  }
}

Database Integration

The ranks themselves come from the RankUpdateListener, which is a SolrEventListener and listens for the firstSearcher and newSearcher events. When these events occur, it reads all the records from the database table and recreates two in-memory maps keyed by docID, one returning the computed rank value and the other returning a 2-element list containing the computed thumbs up and down percentages.

Since I needed to connect to the database, I needed to pass in the database connection properties to this listener. I added this new file to my conf directory.

1
2
3
4
5
# Source: solr/example/.../conf/database.properties
database.driverClassName=com.mysql.jdbc.Driver
database.url=jdbc:mysql://localhost:3306/mytestdb
database.username=root
database.password=secret

To get at this file from within my listener, I needed the SolrResourceLoader which looks at the conf directory, and which I could get only from the SolrCore object. I tried implementing SolrCore and getting the SolrCore from the inform(SolrCore) method, but Solr does not allow a listener to implement SolrCore, so I followed the pattern in a few other listeners and just passed SolrCore through the constructor. This works, ie, no extra configuration is needed for the listener in solrconfig.xml to indicate that it takes in SolrCore in its constructor.

I also added the MySQL JAR file (I am using MySQL for my little experiment, in case you haven't guessed from the formatting already) to the solr/lib directory. Doing this ensures that this JAR file will be packaged into the solr.war when I do ant dist-war.

The idea of using a Listener instead of a plain old service was to ensure that the in-memory hashmaps are repopulated every time a commit is sent. This allows for the situations where there have been deletes and the docIDs are no longer pointing to the same documents that they used to before the commit. It also offers a (somewhat crude, IMO) mechanism to trigger ratings refreshes. I think I would prefer to have it repopulate on a timer, say every 5 minutes, and after every commit (which may be less frequent or never). But the code change to do this is quite simple - since the populateRanks() is factored out into a separate method, it can now just be called from two places instead of one. Here is the code for the listener.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
// Source: src/java/org/apache/solr/util/ext/RankUpdateListener.java
package org.apache.solr.util.ext;

import java.io.IOException;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.core.SolrCore;
import org.apache.solr.core.SolrEventListener;
import org.apache.solr.core.SolrResourceLoader;
import org.apache.solr.search.SolrIndexSearcher;

public class RankUpdateListener implements SolrEventListener {

  private static final Integer BASE = 100;
  
  private static final Map<Integer,Integer> ranks = 
    new HashMap<Integer,Integer>();
  private static final Map<Integer,List<Integer>> thumbs = 
    new HashMap<Integer,List<Integer>>();
  
  private SolrCore core;
  private Map<String,String> dbprops;

  public RankUpdateListener(SolrCore core) {
    this.core = core;
    this.dbprops = new HashMap<String,String>();
  }
  
  ////////////// SolrEventListener methods /////////////////
  
  @Override
  public void init(NamedList args) {
    try {
      SolrResourceLoader loader = core.getResourceLoader();
      List<String> lines = loader.getLines("database.properties");
      for (String line : lines) {
        if (StringUtils.isEmpty(line) || line.startsWith("#")) {
          continue;
        }
        String[] kv = StringUtils.split(line, "=");
        dbprops.put(kv[0], kv[1]);
      }
      Class.forName(dbprops.get("database.driverClassName"));
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }

  @Override
  public void newSearcher(SolrIndexSearcher newSearcher,
      SolrIndexSearcher currentSearcher) {
    try {
      populateRanks(newSearcher);
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }
  
  @Override
  public void postCommit() { /* NOOP */ }
  
  ////////////// Service methods /////////////////////
  
  public static List<Integer> getThumbs(int docId) {
    if (thumbs.containsKey(docId)) {
      return thumbs.get(docId);
    } else {
      return Arrays.asList(0, 0);
    }
  }
  
  public static Integer getRank(int docId) {
    if (ranks.containsKey(docId)) {
      return ranks.get(docId);
    } else {
      return BASE;
    }
  }
  
  // filling it in from a hardcoded set of terms, but will
  // ultimately be set from a database table
  private synchronized void populateRanks(SolrIndexSearcher searcher) 
      throws IOException {
    Map<Integer,Integer> ranksCopy = new HashMap<Integer,Integer>();
    Map<Integer,List<Integer>> thumbsCopy = 
      new HashMap<Integer,List<Integer>>();
    Connection conn = null;
    PreparedStatement ps = null;
    ResultSet rs = null;
    try {
      conn = DriverManager.getConnection(
        dbprops.get("database.url"), 
        dbprops.get("database.username"), 
        dbprops.get("database.password"));
      ps = conn.prepareStatement(
        "select uid, n_up, n_dn, n_tot from thumbranks");
      rs = ps.executeQuery();
      while (rs.next()) {
        String uid = rs.getString(1);
        int nup = rs.getInt(2);
        int ndn = rs.getInt(3);
        int ntot = rs.getInt(4);
        // calculate thumbs up/down percentages
        int tupPct = Math.round((float) (nup * 100) / (float) ntot);
        int tdnPct = Math.round((float) (ndn * 100) / (float) ntot);
        // calculate score, rebasing by 100 to make positive
        int rank = tupPct - tdnPct + BASE;
        // find Lucene docId for uid
        ScoreDoc[] hits = searcher.search(
          new TermQuery(new Term("uid", uid)), 1).scoreDocs;
        for (int i = 0; i < hits.length; i++) {
          int docId = hits[i].doc;
          ranksCopy.put(docId, rank);
          thumbsCopy.put(docId, Arrays.asList(tupPct, tdnPct));
        }
      }
    } catch (SQLException e) {
      throw new IOException(e);
    } finally {
      if (conn != null) { 
        try { conn.close(); } catch (SQLException e) { /* NOOP */ }
      }
    }
    ranks.clear();
    ranks.putAll(ranksCopy);
    thumbs.clear();
    thumbs.putAll(thumbsCopy);
  }
}

The listener needs to be configured in the solrconfig.xml file, here is the relevant snippet from mine.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<!-- Source: solr/examples/.../conf/solrconfig.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<config>
  ...
    <listener event="firstSearcher" 
        class="org.apache.solr.util.ext.RankUpdateListener"/>
    <listener event="newSearcher" 
        class="org.apache.solr.util.ext.RankUpdateListener"/>
  ...
</config>

Returning external data in response

At this point, if I rebuild my solr.war with these changes, I can sort the results by adding a &sort=rank+desc (or rank+asc) to my Solr request URL.

However, the caller would have to consult the database to get the thumbs up and down percentages, since the response does not contain this information. Neither can we specify that we want this information by specifying &fl=*,rank.

This is (fairly) easily remedied, however. Simply add a SearchComponent that checks to see if the fl parameter contains "rank", and if so, convert the DocSlice into a SolrDocumentList and plug in the cached values by looking up the service methods in the RankUpdateListener component above.

So, configuration wise, we need to add a last-components entry into the handler that we are using. We are using the "standard" SearchHandler, so we just add our custom component in our solrconfig.xml file as shown in the snippet below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<!-- Source: solr/example/.../conf/solrconfig.xml -->
<config>
  ...
  <searchComponent name="rank-extract" 
      class="org.apache.solr.handler.component.ext.RankExtractComponent"/>
  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
    </lst>
    <arr name="last-components">
      <str>rank-extract</str>
    </arr>
  </requestHandler>
  ...
</config>

Here is the code for the RankExtractComponent. It only does anything if "rank" is found in the input request's fl (field list) parameter. In case it is found, it will add the rank, thumbs_up and thumbs_down percentages into the document response.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
// Source: src/java/org/apache/solr/handler/component/ext/RankExtractComponent.java
package org.apache.solr.handler.component.ext;

import java.io.IOException;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Fieldable;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.common.params.CommonParams;
import org.apache.solr.handler.component.ResponseBuilder;
import org.apache.solr.handler.component.SearchComponent;
import org.apache.solr.schema.IndexSchema;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.search.DocIterator;
import org.apache.solr.search.DocSlice;
import org.apache.solr.search.SolrIndexReader;
import org.apache.solr.util.ext.RankUpdateListener;

/**
 * Custom component that extracts rank information out of
 * the database and appends it to the response. This is 
 * configured as a last-components in our search handler
 * chain.
 */
public class RankExtractComponent extends SearchComponent {

  @Override
  public void prepare(ResponseBuilder rb) throws IOException {
    /* NOOP */
  }

  @Override
  public void process(ResponseBuilder rb) throws IOException {
    Set<String> returnFields = getReturnFields(rb);
    if (returnFields.contains("rank")) {
      // only trigger this code if user explicitly lists rank
      // in the field list. This changes the DocSlice in the 
      // result returned by the standard component and replaces
      // it with a SolrDocumentList (whose attributes are more 
      // amenable to modification).
      DocSlice slice = (DocSlice) rb.rsp.getValues().get("response");
      SolrIndexReader reader = rb.req.getSearcher().getReader();
      SolrDocumentList rl = new SolrDocumentList();
      for (DocIterator it = slice.iterator(); it.hasNext(); ) {
        int docId = it.nextDoc();
        Document doc = reader.document(docId);
        SolrDocument sdoc = new SolrDocument();
        List<Fieldable> fields = doc.getFields();
        for (Fieldable field : fields) {
          String fn = field.name();
          if (returnFields.contains(fn)) {
            sdoc.addField(fn, doc.get(fn));
          }
        }
        if (returnFields.contains("score")) {
          sdoc.addField("score", it.score());
        }
        if (returnFields.contains("rank")) {
          List<Integer> thumbs = RankUpdateListener.getThumbs(docId);
          sdoc.addField("thumbs_up", thumbs.get(0));
          sdoc.addField("thumbs_down", thumbs.get(1));
          sdoc.addField("rank", RankUpdateListener.getRank(docId));
        }
        rl.add(sdoc);
      }
      rl.setMaxScore(slice.maxScore());
      rl.setNumFound(slice.matches());
      rb.rsp.getValues().remove("response");
      rb.rsp.add("response", rl);
    }
  }

  private Set<String> getReturnFields(ResponseBuilder rb) {
    Set<String> fields = new HashSet<String>();
    String flp = rb.req.getParams().get(CommonParams.FL);
    if (StringUtils.isEmpty(flp)) {
      // called on startup with a null ResponseBuilder, so
      // we want to prevent a spurious NPE in the logs...
      return fields; 
    }
    String[] fls = StringUtils.split(flp, ",");
    IndexSchema schema = rb.req.getSchema();
    for (String fl : fls) {
      if ("*".equals(fl)) {
        Map<String,SchemaField> fm = schema.getFields();
        for (String fieldname : fm.keySet()) {
          SchemaField sf = fm.get(fieldname);
          if (sf.stored() && (! "content".equals(fieldname))) {
            fields.add(fieldname);
          }
        }
      } else if ("id".equals(fl)) {
        SchemaField usf = schema.getUniqueKeyField();
        fields.add(usf.getName());
      } else {
        fields.add(fl);
      }
    }
    return fields;
  }

  ///////////////////// SolrInfoMBean methods ///////////////////
  
  @Override
  public String getDescription() {
    return "Rank Extraction Component";
  }

  @Override
  public String getSource() {
    return "$Source$";
  }

  @Override
  public String getSourceId() {
    return "$Id$";
  }

  @Override
  public String getVersion() {
    return "$Revision$";
  }
}

Conclusion

And thats pretty much it! Just 5 new classes and a bit of configuration, and we have a deterministic external field based sort on Solr, with a commit-aware (and optionally timer based) refresh strategy, which returns the additional fields in the Solr search results.

And now for the obligatory screenshots. The one on the left shows the results with rank information (ie, with &fl=*,rank only) sorted by relevance. The one on the right shows the same results, but sorted by rank descending (ie, with &fl=*,rank&sort=rank+desc).

The alternative (ie, going the ExternalFileField route) would require some coding (perhaps outside Solr) to write out the database data into a flat file and push it into Solr's index directory, followed by a commit. And the client would have to make a separate call to get the ranking components for each search result at render time. Also, I don't understand function queries that well (okay, not at all, need to read up on them), but from our limited testing, it did not appear to be too effective at sorting, but its entirely possible that I am missing something that would make it work nicely.

Of course, I have a lot to learn about Solr and Lucene, so if you have solved this differently, or if you feel this could have been done differently/better, would love to hear from you.