Salmon Run: February 2011

Saturday, February 26, 2011

Solr: a custom Search RequestHandler

As you know, I've been playing with Solr lately, trying to see how feasible it would be to customize it for our needs. We have been a Lucene shop for a while, and we've built our own search framework around it, which has served us well so far. The rationale for moving to Solr is driven primarily by the need to expose our search tier as a service for our internal applications. While it would have been relatively simple (probably simpler) to slap on an HTTP interface over our current search tier, we also want to use the other Solr features such as incremental indexing and replication.

One of our challenges to using Solr is that the way we do search is quite different from the way Solr does search. A query string passed to the default Solr search handler is parsed into a Lucene query and a single search call is made on the underlying index. In our case, the query string is passed to our taxonomy, and depending on the type of query (as identified by the taxonomy), it is sent through one or more sub-handlers. Each sub-handler converts the query into a (different) Lucene query and executes the search against the underlying index. The results from each sub-handler are then layered together to present the final search result.

Conceptually, the customization is quite simple - simply create a custom subclass of RequestHandlerBase (as advised on this wiki page) and override the handleRequestBody(SolrQueryRequest, SolrQueryResponse) method. In reality, I had quite a tough time doing this, admittedly caused (at least partly) by my ignorance of Solr internals. However, I did succeed, so, in this post, I outline my solution, along with some advice I feel would be useful to others embarking on a similar route.

Configuration and Code

The handler is configured to trigger in response to a /solr/mysearch request. Here is the (rewritten for readability) XML snippet from my solrconfig.xml file. I used the "invariants" block to pass in configuration parameters for the handler.

  ...
  <requestHandler name="/mysearch" 
      class="org.apache.solr.handler.ext.MyRequestHAndler">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="fl">*,score</str>
      <str name="wt">xml</str>
    </lst>
    <lst name="invariants">
      <str name="prop1">value1</str>
      <int name="prop2">value2</int>
      <!-- ... more config items here ... -->
    </lst>
  </requestHandler>
  ...

And here is the (also rewritten for readability) code for the custom handler. I used the SearchHandler and MoreLikeThisHandler as my templates, but diverged from it in several ways in order to accomodate my requirements. I will describe them below.

package org.apache.solr.handler.ext;

// imports omitted

public class MyRequestHandler extends RequestHandlerBase {

  private String prop1;
  private String prop2;
  ...
  private TaxoService taxoService;

  @Override
  public void init(NamedList args) {
    super.init(args);
    this.prop1 = invariants.get("prop1");
    this.prop2 = Integer.valueOf(invariants.get("prop2"));
    ...
    this.taxoService = new TaxoService(prop1);
  }

  @Override
  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp)
      throws Exception {

    // extract params from request
    SolrParams params = req.getParams();
    String q = params.get(CommonParams.Q);
    String[] fqs = params.getParams(CommonParams.FQ);
    int start = 0;
    try { start = Integer.parseInt(params.get(CommonParams.START)); } 
    catch (Exception e) { /* default */ }
    int rows = 0;
    try { rows = Integer.parseInt(params.get(CommonParams.ROWS)); } 
    catch (Exception e) { /* default */ }
    SolrPluginUtils.setReturnFields(req, rsp);

    // build initial data structures
    TaxoResult taxoResult = taxoService.getResult(q);
    SolrDocumentList results = new SolrDocumentList();
    SolrIndexSearcher searcher = req.getSearcher();
    Map<String,SchemaField> fields = req.getSchema().getFields();
    int ndocs = start + rows;
    Filter filter = buildFilter(fqs, req);
    Set<Integer> alreadyFound = new HashSet<Integer>();

    // invoke the various sub-handlers in turn and return results
    doSearch1(results, searcher, q, filter, taxoResult, ndocs, req, 
      fields, alreadyFound);
    doSearch2(results, searcher, q, filter, taxoResult, ndocs, req, 
      fields, alreadyFound);
    // ... more sub-handler calls here ...

    // build and write response
    float maxScore = 0.0F;
    int numFound = 0;
    List<SolrDocument> slice = new ArrayList<SolrDocument>();
    for (Iterator<SolrDocument> it = results.iterator(); it.hasNext(); ) {
      SolrDocument sdoc = it.next();
      Float score = (Float) sdoc.getFieldValue("score");
      if (maxScore < score) {
        maxScore = score;
      }
      if (numFound >= start && numFound < start + rows) {
        slice.add(sdoc);
      }
      numFound++;
    }
    results.clear();
    results.addAll(slice);
    results.setNumFound(numFound);
    results.setMaxScore(maxScore);
    results.setStart(start);
    rsp.add("response", results);

  }

  private Filter buildFilter(String[] fqs, SolrQueryRequest req) 
      throws IOException, ParseException {
    if (fqs != null && fqs.length > 0) {
      BooleanQuery fquery = new BooleanQuery();
      for (int i = 0; i < fqs.length; i++) {
        QParser parser = QParser.getParser(fqs[i], null, req);
        fquery.add(parser.getQuery(), Occur.MUST);
      }
      return new CachingWrapperFilter(new QueryWrapperFilter(fquery));
    }
    return null;
  }

  private void doSearch1(SolrDocumentList results,
      SolrIndexSearcher searcher, String q, Filter filter, 
      TaxoResult taxoResult, int ndocs, SolrQueryRequest req,
      Map<String,SchemaField> fields, Set<Integer> alreadyFound) 
      throws IOException {
    // check entry condition
    if (! canEnterSearch1(q, filter, taxoResult)) {
      return;
    }
    // build custom query and extra fields
    Query query = buildCustomQuery1(q, taxoResult);
    Map<String,Object> extraFields = new HashMap<String,Object>();
    extraFields.put("search_type", "search1");
    boolean includeScore = 
      req.getParams().get(CommonParams.FL).contains("score"));
    append(results, searcher.search(
      query, filter, maxDocsPerSearcherType).scoreDocs,
      alreadyFound, fields, extraFields, maprelScoreCutoff, 
      searcher.getReader(), includeScore);
  }

  // ... more doSearchXXX() calls here ...

  private void append(SolrDocumentList results, ScoreDoc[] more, 
      Set<Integer> alreadyFound, Map<String,SchemaField> fields,
      Map<String,Object> extraFields, float scoreCutoff, 
      SolrIndexReader reader, boolean includeScore) throws IOException {
    for (ScoreDoc hit : more) {
      if (alreadyFound.contains(hit.doc)) {
        continue;
      }
      Document doc = reader.document(hit.doc);
      SolrDocument sdoc = new SolrDocument();
      for (String fieldname : fields.keySet()) {
        SchemaField sf = fields.get(fieldname);
        if (sf.stored()) {
          sdoc.addField(fieldname, doc.get(fieldname));
        }
      }
      for (String extraField : extraFields.keySet()) {
        sdoc.addField(extraField, extraFields.get(extraField));
      }
      if (includeScore) {
        sdoc.addField("score", hit.score);
      }
      results.add(sdoc);
      alreadyFound.add(hit.doc);
    }
  }
  
  //////////////////////// SolrInfoMBeans methods //////////////////////

  @Override
  public String getDescription() {
    return "My Search Handler";
  }

  @Override
  public String getSource() {
    return "$Source$";
  }

  @Override
  public String getSourceId() {
    return "$Id$";
  }

  @Override
  public String getVersion() {
    return "$Revision$";
  }
}

Configuration Parameters - I started out baking most of my "configuration" parameters as constants within the handler code, but later moved them into the invariants block in the XML declaration. Not ideal, since we still need to touch the solrconfig.xml file (which is regarded as application code in our environment) to change behavior. The ideal solution, given the circumstances, would probably be to use JNDI to hold the configuration parameters and have the handler connect to the JNDI to pull the properties it needs.

Using Filter - The MoreLikeThis handler converts the fq (filter query) parameter into a List of Query objects, because this is what is needed to pass into a searcher.getDocList(). In my case, I couldn't use DocListAndSet because DocList is unmodifiable (ie, DocList.add() throws an UnsupportedOperationException). So I fell back to the pattern I am used to, which is getting the ScoreDoc[] array from a standard searcher.search(Query,Filter,numDocs) call. That is why the buildFilter() above returns a Filter and not a List<Query>.

Connect to external services - My handler needs to connect to the taxonomy service. Our taxonomy exposes an RMI service with a very rich and fine-grained API. I tried to use this at first, but ran into problems because it needs access to configuration files on the local system, and Jetty couldn't see these files because it was not within its context. I ended up solving for this by exposing a coarse grained JSON service over HTTP on the taxonomy service. The handler calls it once per query and gets back all the information that it needs in a single call. Probably not ideal, since now the logic is spread out in two places - I will probably revisit the RMI client integration again in the future.

Layer multiple resultsets - This is the main reason for writing the custom handler. Most of the work happens in the append() method above. Each sub-handler calls SolrSearcher.search(Query, Filter, numDocs) and populates its resulting ScoreDocs array into a List<SolrDocument>. Since previous sub-handlers may have already returned a result, subsequent sub-handlers check against a Set of docIds.

Add a pseudo-field to the Document - There are currently two competing initiatives in Solr (SOLR-1566 and SOLR-1298) on how to handle this situation. Since I was populating SolrDocument objects (this was one of the reasons I started using SolrDocumentList), it was relatively simple for me to pass in a Map of extra fields which are just tacked on to the end of the SolrDocument.

Some Miscellaneous advice

Here is some advice and tips which I wish someone had told me before I started out on this.

For your own sanity, standardize on a Solr release. I chose 1.4.1 which is the latest at the time of writing this. Prior to that, I was developing within the Solr trunk. One day (after about 60-70% of my code was working), I decided to do an svn update, and all of a sudden there was a huge bunch of compile failures (in my code as well as the Solr code). Some of them were probably caused by missing/out-of-date JARs in my .classpath. But the point is that Solr code is being actively developed, and there is quite a bit of code churn, and if you really want to work on the trunk (or a pre-release branch), you should be ready to deal with these situtations.

Solr is well designed (so the flow is kind of intuitive) and reasonably well documented, but there are some places where you will probably need to step through the code in a debugger to figure out what's going on. I am still using the Jetty container in the examples subdirectory. This page on Lucid Imagination outlines the steps you need to run Solr within Eclipse using the Jetty plugin, but thanks to the information on this StackOverlow page, all I did was add some command-line parameters to the java call, like so:

1
2
3

sujit@cyclone:example$ java -Dsolr.solr.home=my_schema \
  -agentlib:jdwp=transport=dt_socket,server=y,address=8883,suspend=n \
  -jar start.jar

and then set up an external debug configuration for localhost:8883 in Eclipse, and I could step through the code just fine.

Solr has very aggressive caching (which is great for a production environment), but for development, you need to disable it. I did this by commenting out all the cache references for filterCache, queryResultCache and documentCache in solrconfig.xml, and changed the httpCaching to use never304="true". All these are in the solrconfig.xml file.

Conclusion

The approach I described here is not as performant as the "standard" flow. Because I have to do multiple searches in a single request, I am doing more I/O. I am also consuming more CPU cycles since I have to dedup documents across each layer. I am also consuming more memory per request because I populate the SolrDocument inline rather than just pass the DocListAndSet to the ResponseBuilder. I don't see a way around it, though, given the nature of my requirements.

If you are a Solr expert, or someone who is familiar with the internals, I would appreciate hearing your thoughts about this approach - criticisms and suggestions are welcome.

Saturday, February 12, 2011

Solr, Porter Stemming and Stemming Exclusions

The Porter Stemmer is somewhat of a gold standard when it comes to stemming for search applications, allowing you to match inflected words in your query against similarly inflected words in your index. For example, a search for "abnormal" would return documents containing "abormality", "abnormalities", "abnormally" because all these words have been stemmed to "abnorm" at index time, and "abnormal" is also stemmed to "abnorm" at query time. However, as Ted Dziuba points out, when stemming works, it is very, very good, but when it doesn't, the results can be pretty horrible.

There are other stemmers available, but as paper by Hull comparing various stemmers for a group of queries (PDF Download) shows, there is no one true stemmer that outperforms others consistently. Most people end up with using one or the other stemmer with exclusion sets, or less commonly, modify the stemmer rules directly.

In our case, we built a custom analyzer that checks the exclusion set (supplied as a flat file of words that should not be stemmed). If the word is in the exclusion set, Porter stemming is skipped. In Solr one has to supply the filters as a chain, so our current approach wouldn't carry over directly. An alternative would have been to build this functionality into a custom token filter which would invoke the Porter stemmer only if the token was not found in its exclusion set (subclassing the Porter Stem TokenFilter is not possible since TokenFilters implementations are all final).

Solr anticipates this use case and provides the SnowballPorterFilterFactory, which allows you to provide the exclusion set via an init-arg named "protected" (which would point to a file in the CLASSPATH). It does this by inserting the KeywordMarkerFilter in front of the language specific Snowball filter. When a word is found in the exclusion set, it is marked as a keyword (using the Keyword attribute). The Snowball filter checks to see if the incoming term is marked as a keyword, and if so, does not stem the word.

To test the functionality, I wrote a little JUnit class (I know, Solr has built in regression testing, and it doesn't make much sense for me to test it). But my original intent was to compare the stemming of our custom analyzer versus the one I was trying to set up as my default text analyzer in Solr (using off-the-shelf Solr components as far as possible), so this was actually a part of it, so I figured it couldn't hurt to check it out for myself. Here is the relevant snippet of the testcase.

// Source: src/test/org/apache/solr/analysis/ext/PorterStemmerWithExclusionsTest.java
package org.apache.solr.analysis.ext;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.FilenameFilter;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.NumberUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.util.Version;
import org.apache.solr.analysis.SnowballPorterFilterFactory;
import org.apache.solr.common.ResourceLoader;
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
import org.junit.Test;

public class PorterStemmerWithExclusionsTest {

  @Test
  public void testPorterExclusion() throws Exception {
   String[] inputTerms = new String[] {
     "mariner", "marin", "marketing", "market"
   };
   // without stemming exclusions
   System.out.println("==== without stemming exclusions =====");
   List<String> protectedWords = new ArrayList<String>();
   Analyzer analyzer0 = getAnalyzer(protectedWords);
   for (String inputTerm : inputTerms) {
     TokenStream input = analyzer0.tokenStream(
      "f", new StringReader(inputTerm));
     while (input.incrementToken()) {
       CharTermAttribute termAttribute = 
         input.getAttribute(CharTermAttribute.class);
       String outputTerm = termAttribute.toString();
       boolean isKeyword = 
         input.getAttribute(KeywordAttribute.class).isKeyword();
       System.out.println(inputTerm + "(keyword=" + isKeyword + ") => " + 
         outputTerm);
     }
   }
   // with stemming exclusions
   System.out.println("==== with stemming exclusions =====");
   protectedWords.add("marketing");
   protectedWords.add("mariner");
   Analyzer analyzer1 = getAnalyzer(protectedWords);
   for (String inputTerm : inputTerms) {
     TokenStream input = analyzer1.tokenStream(
       "f", new StringReader(inputTerm));
     while (input.incrementToken()) {
       CharTermAttribute termAttribute = 
         input.getAttribute(CharTermAttribute.class);
       String outputTerm = termAttribute.toString();
       boolean isKeyword = 
         input.getAttribute(KeywordAttribute.class).isKeyword();
       System.out.println(inputTerm + "(keyword=" + isKeyword + ") => " + 
         outputTerm);
     }
   }
  }

  private Analyzer getAnalyzer(final List<String> protectedWords) {
    return new Analyzer() {
      @Override 
      public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream input = new WhitespaceTokenizer(Version.LUCENE_40, reader);
        input = new LowerCaseFilter(Version.LUCENE_40, input);
        SnowballPorterFilterFactory factory = new SnowballPorterFilterFactory();
        Map<String,String> args = new HashMap<String,String>();
        args.put("luceneMatchVersion", Version.LUCENE_40.name());
        args.put("language", "English");
        if (! protectedWords.isEmpty()) {
          args.put("protected", "not-a-null.txt");
        }
        factory.init(args);
        factory.inform(new LinesMockSolrResourceLoader(protectedWords));
        return factory.create(input);
      }
    };
  }

  private class LinesMockSolrResourceLoader implements ResourceLoader {
    List<String> lines;
    
    public LinesMockSolrResourceLoader(List<String> lines) {
      this.lines = lines;
    }

    @Override
    public List<String> getLines(String resource) throws IOException {
      return lines;
    }

    @Override
    public Object newInstance(String cname, String... subpackages) {
      return null;
    }

    @Override
    public InputStream openResource(String resource) throws IOException {
      return null;
    }
  }
}

The results of the test are shown below. As you can see, by default Porter Stemmer stems both "mariner" and "marin" to "marin", and "market" and "marketing" down to "market". Once the exclusions are added, the results are in line with user expectations. These examples are taken from Ted Dziuba's post (referenced earlier), read it for the background if you haven't already.

    [junit] ==== without stemming exclusions =====
    [junit] mariner(keyword=false) => marin
    [junit] marin(keyword=false) => marin
    [junit] marketing(keyword=false) => market
    [junit] market(keyword=false) => market
    [junit] ==== with stemming exclusions =====
    [junit] mariner(keyword=true) => mariner
    [junit] marin(keyword=false) => marin
    [junit] marketing(keyword=true) => marketing
    [junit] market(keyword=false) => market

So Solr provides the necessary functionality to override your stemming algorithm with a list of exclusions. Of course, you still need to figure out the words to put in the exclusion set. One approach is to start with an empty set and add them in by scanning and stemming queries from your search logs. You could also start by scanning your document set (or a representative sample) to find words that are mis-stemmed and add them to the exclusion set, and then scan search logs periodically to find new occurrences.

I describe the second approach below. Documents that make up our index come to us in various formats - HTML (for crawled content), XML (from our content providers) and JSON (from our CMS). Basically, what we need to do is to extract the text from these documents and feed it in, word by word, to our analyzer, and collect the stemmed form. We then create a report of the stemmed form and a list of the various words that stemmed to this form. Here is the code snippet (modelled as a JUnit test in the same class as the one showed above).

  ...
  @Test
  public void testFindCandidatesForExclusion() throws Exception {
    Map<String,Set<String>> stemmedTerms = new HashMap<String,Set<String>>();
    List<String> protectedWords = new ArrayList<String>();
    Analyzer analyzer = getAnalyzer(protectedWords);
    Set<String> stopSet = getStopSet();
    File[] xmls = new File("/path/to/xmlfiles").listFiles(
      new FilenameFilter() {
        @Override public boolean accept(File dir, String name) {
          return name.endsWith(".xml");
        }
      }
    );
    for (File xml : xmls) {
      System.out.println("Processing file: " + xml.getAbsolutePath());
      SAXReader saxReader = new SAXReader();
      saxReader.setValidation(false);
      Document xdoc = saxReader.read(xml);
      StringBuilder buf = new StringBuilder();
      extractTextFromElementAndChildren(xdoc.getRootElement(), buf);
      // break up the input by whitespace and punctuation
      String[] words = buf.toString().split("[\\p{Punct}|\\p{Space}]");
      for (String word : words) {
        if (NumberUtils.isNumber(word) || StringUtils.isEmpty(word)) {
          continue;
        }
        word = word.replaceAll("\"", "");
        word = word.replaceAll("[^\\p{ASCII}]", "");
        word = StringUtils.lowerCase(word);
        if (stopSet.contains(word)) {
          continue;
        }
        TokenStream input = analyzer.tokenStream("f", new StringReader(word));
        while (input.incrementToken()) {
          CharTermAttribute termAttribute = 
            input.getAttribute(CharTermAttribute.class);
          String stemmed = termAttribute.toString();
          Set<String> originalWords = stemmedTerms.containsKey(stemmed) ?
            stemmedTerms.get(stemmed) : new HashSet<String>();
          originalWords.add(word);
          stemmedTerms.put(stemmed, originalWords);
        }
      }
    }
    // write this out
    PrintWriter writer = new PrintWriter(new FileWriter(
      new File("/tmp/stem-results.txt")));
    List<String> stemmedKeys = new ArrayList<String>();
    stemmedKeys.addAll(stemmedTerms.keySet());
    Collections.sort(stemmedKeys);
    for (String stemmedKey : stemmedKeys) {
      Set<String> originalWords = stemmedTerms.get(stemmedKey);
      if (originalWords.size() > 1) {
        writer.println(stemmedKey + " => " + 
          StringUtils.join(stemmedTerms.get(stemmedKey).iterator(), ", "));
      }
    }
    writer.flush();
    writer.close();
  }
  
  private void extractTextFromElementAndChildren(
      Element parent, StringBuilder buf) {
    String text = parent.getTextTrim();
    if (text.length() > 0) {
      buf.append(text).append(text.endsWith(".") ? " " : ". ");
    }
    List<Element> children = parent.elements();
    for (Element child : children) {
      extractTextFromElementAndChildren(child, buf);
    }
  }

  private Set<String> getStopSet() {
    Set<String> stopset = new HashSet<String>();
    try {
      BufferedReader reader = new BufferedReader(new FileReader(
        new File("/path/to/stopwords.txt")));
      String line;
      while ((line = reader.readLine()) != null) {
        if (StringUtils.isEmpty(line) || line.startsWith("#")) {
          continue;
        }
        stopset.add(line);
      }
      reader.close();
      return stopset;
    } catch (Exception e) {
      return stopset;
    }
  }
  ...

After the run, I manually went through the report and picked out the ones I think were mis-stemmed for my context. They are shown below - there are only 16 out of over 6000 stems that were created from this corpus (of around 100 documents), so Porter stemmer did the right thing 99.7% of the time (for this corpus), which is quite impressive. Some of these are kind of context-dependent, for example "race" and "racing" mean different things in a health context, but probably not in a sports context.

aerob => aerobics, aerobic
aid => aids, aiding, aid, aided, aides
angl => angles, angling, angled, angle
anim => anim, animals, animation, animal
arm => arms, arm, armed
bitter => bittering, bitterly, bitter
coupl => couplings, coupling, coupled, couple, couples
dead => deadly, dead
depress => depressants, depress, depresses, depressed, depressions, depressing, depressant, depressive, depression
easter => easter, easterly
head => headings, headed, heads, head, heading
mortal => mortalities, mortal, mortality
physic => physical, physics, physically
plagu => plague, plagued
plumb => plumb, plumbing
race => racing, races, race

The process of selecting the misstemmed terms is manual and quite painful (somewhat like looking for a needle in a haystack), but I think the report can be whittled down somewhat by calculating the similarity between the meanings of the original words - for example, "depressed" and "depression" are probably close enough so we wouldn't care about them if they were the only words stemmed to "depress". I haven't tried that yet, but this approach seems feasible based on this paper describing Wordnet::Similarity by Pedersen, Patwardhan and Michelizzi (PDF Download). I will report my findings on this in a future post.

Salmon Run

Saturday, February 26, 2011

Solr: a custom Search RequestHandler

Configuration and Code

Some Miscellaneous advice

Conclusion

Saturday, February 12, 2011

Solr, Porter Stemming and Stemming Exclusions

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me