Sunday, April 26, 2015

Implementing Quora Activity Feeds on your website


The website for Elsevier Labs (where I work) went online this week, do check it out and provide feedback on what kind of content you would like to see on there. One of the components on the front page is a Twitter feed. So far I have avoided Twitter, because I thought it was a bit presumptious to assume that people would actually care about my 140 character thoughts. So one of my first suggestions for "improvements" was to ask if we could include feeds from Quora and LinkedIn, where I am actually somewhat active.

While making the suggestions, I also checked what was available, and found this unofficial API for Quora written by Christopher Su, complete with a Heroku server that you can connect to. So as a little proof of concept, I decided to see if I could pull out data about my own activity on Quora for the last 2 months. Sadly, I couldn't find a LinkedIn API for user activity, if you know of any, please let me know and I will check it out.

The code is really simple, and uses just 2 of the 6 or so services provided by the Quora API, the profile and the answer activity services. Here is the (really simple) code to pull out my activity.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
import urllib2
import yaml

profile = urllib2.urlopen("http://quora-api.herokuapp.com/users/your-quora-id")
pjson = yaml.load(profile)
profile.close()

answers = urllib2.urlopen("http://quora-api.herokuapp.com/users/your-quora-id/activity/answers")
ajson = yaml.load(answers)
answers.close()

now = datetime.utcnow()

print("<html>")
print("<head><title></title><body>")
print("<ul>")
for item in ajson["items"]:
    summary = BeautifulSoup(item["summary"])
    summary_text = summary.get_text()[10:100]
    pubdttm = datetime.strptime(item["published"], "%a, %d %b %Y %H:%M:%S %Z")
    # only show recent posts (2 months old)
    if pubdttm <= now - timedelta(weeks=8):
        continue
    print("<li>%d/%d/%d: " % (pubdttm.month, pubdttm.day, pubdttm.year))
    print("%s (answers: %d, followers: %d, following: %d) " % (
        pjson["name"], pjson["answers"], pjson["followers"], 
        pjson["following"]))
    print("answered <b>%s</b>: <a href=\"%s\">%s...</a>" % (
            item["title"], item["link"], summary_text))
    print("</li>")
print("</ul>")
print("</body></html>")

And it produces an HTML snippet that (without any styling) looks something like this:


So in any case that was all I had for this week. I plan on being more active on Twitter going forward, because over the last few years I have observed a few good friends using it to share papers and articles, which seems like a good use case for it - blogs are a bit too heavyweight for that sort of stuff. In fact, I have already started doing so, you can find my posts by searching for @palsujit. As part of this effort I will also, at the risk of being considered incredibly self-serving, start promoting my own blog posts on Twitter. I also hope to write meatier blog posts here once things have stabilized a bit and the learning curve at the job is not as steep as it is now.

Saturday, April 18, 2015

Scoring token distance in Lucene sloppy queries


Lucene (and Lucene based search platforms such as Solr and ElasticSearch) has the concept of slop for inexact phrase queries, which specifies how loose the phrase matching can be. For example, "four seven" would not match a document containing the Gettysburg Address, but "four seven"~2 or "seven four"~3 would. Here the slop is the number of token transpositions it would take for the query string to match the target string.

This was all I knew about slop for a long time and I never had a need to know beyond this. Recently, however, I was working with some people whose had a background in other search engines, and they raised the question, that given two documents like this:

  • Four of five fathers don't know..."
  • Four score and seven years ago, our fathers brought forth on this continent...

and a query "four fathers"~20, which one of these would score higher and by how much? My understanding was that slop is merely a filtering mechanism and both results would show up in no particular order. Turns out this is not quite true - while slop does serve to filter the results, the distance between the tokens does influence the final scores. I would have chalked it up to ignorance on my part and moved on, but a quick poll among some of my Lucene programmer colleagues revealed that this is something most people hadn't really thought about, so I thought it may be useful to write about it. Hence this post.

In order to investigate the variation of score with token distance, I set up the following simple experiment. I built a set of "documents", each with exactly 42 1-character "words" consisting of the letters a-j repeated 4 times. The first word in each document was x and another character y was put into the second, third, fourth, etc token positions. So my "documents" looked something like this:

1
2
3
4
x y a b c d ...
x a y b c d ...
x a b y c d ...
...

To build this corpus of documents, I wrote this little Python script to build an array of JSON documents like so:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import string

first_10 = string.lowercase[0:10]
doc_tpl = first_10[0:1] + 'x' + first_10[1:] + first_10*3
doc_id = 0

print("[")
for i in range(len(doc_tpl)-2):
   doc = doc_tpl[0:i+2] + 'y' + doc_tpl[i+2:]

   doc_title = " ".join([c for c in doc])
   print("{\"id\":\"%s\", \"title\":\"%s\"}," % (doc_id, doc_title))
   doc_id = doc_id + 1
print("]")

This is sent to Solr's update handler using a curl command like so:

1
2
curl "http://localhost:8983/solr/update/json?commit=true" \
    -H "Content-type:application/json" --data-binary @test_data.json

We then send a single sloppy query "x y"~20 to Solr and graph the resulting scores. I then use Scipy's curve_fit command to fit the points to a formula I derived by looking at the explain output and the code (described after the graph).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from __future__ import division
import urllib2
import urllib
import json
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
import math

def proximity_score(x, a, b):
  return a + b / np.power(x + 1, 0.5)

url = "http://localhost:8983/solr/collection1/select"

urlparams = {
  "q": "*:*",
  "wt": "json",
  "rows": "0",
  "fl": "id,score"
}
conn = urllib2.urlopen(url + "?" + urllib.urlencode(urlparams))
response = json.load(conn)
count = response["response"]["numFound"]
conn.close()
urlparams["q"] = "\"x y\"~20"
urlparams["rows"] = str(count)
conn = urllib2.urlopen(url + "?" + urllib.urlencode(urlparams))
response = json.load(conn)
xs = []
ys = []
for doc in response["response"]["docs"]:
  xs.append(int(doc["id"]))
  ys.append(float(doc["score"]))
conn.close()

axs = np.array(xs)
ays = np.array(ys)
plt.plot(axs, ays, 'bo', label="Actual")

popt, pcov = curve_fit(proximity_score, axs, ays)
print popt
plt.plot(axs, proximity_score(axs, *popt), 'r--', linewidth=1.5, label="Fitted")

plt.xlabel("Distance between tokens")
plt.ylabel("Score")
plt.legend()
plt.show()

This gives us the chart shown below. The blue points are the actual scores returned for each record with the specified token distances between the tokens x and y, and the red dotted line is the curve given by:

1
score = 9.96e-10 + (0.25 / math.sqrt(dist + 1))

Here the intercept term is almost zero, and the 0.25 in the numerator is an artifact of the other parameters in the experiment. The takeaway is that scores vary as the inverse of the square root of the between token distance for sloppy queries, all other things being constant.


I had initially tried to plot an exponential curve, but it didn't match as well as this one. So I started looking at the explain output for clues. Here is the explain for the top 3 results, formatted for readbility.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
0: 0.24367055 = 
   (MATCH) weight(text:"x y"~20 in 0) [DefaultSimilarity], result of:
     0.24367055 = fieldWeight in 0, product of:
       1.0 = tf(freq=1.0), with freq of:
         1.0 = phraseFreq=1.0
       1.9493644 = idf(), sum of:      
         0.9746822 = idf(docFreq=39, maxDocs=39)
         0.9746822 = idf(docFreq=39, maxDocs=39)
       0.125 = fieldNorm(doc=0),
1: 0.1723011 = 
   (MATCH) weight(text:"x y"~20 in 1) [DefaultSimilarity], result of:
     0.1723011 = fieldWeight in 1, product of:
       0.70710677 = tf(freq=0.5), with freq of:
         0.5 = phraseFreq=0.5    
       1.9493644 = idf(), sum of:
         0.9746822 = idf(docFreq=39, maxDocs=39)
         0.9746822 = idf(docFreq=39, maxDocs=39)
       0.125 = fieldNorm(doc=1),
2: 0.14068326 = 
   (MATCH) weight(text:"x y"~20 in 2) [DefaultSimilarity], result of:
     0.14068326 = fieldWeight in 2, product of:
       0.57735026 = tf(freq=0.33333334), with freq of:
         0.33333334 = phraseFreq=0.33333334
       1.9493644 = idf(), sum of:
         0.9746822 = idf(docFreq=39, maxDocs=39)
         0.9746822 = idf(docFreq=39, maxDocs=39)
       0.125 = fieldNorm(doc=2)\n",
...

On inspecting the output, I realized that the variation was happing in the phraseFreq line, falling off as (1, 0.5, 0.33, ...), ie by the inverse of the distance. Looking at the code for DefaultSimilarity, it looks like phraseFreq contributes the square root of itself plus 1 to the score via tf(phraseFreq) in the line immediately above.

This is all I have for this week. Hopefully I was able to share something you didn't know through this experiment.