Saturday, February 28, 2015

Modeling the Evolution of Skin Color


Sometime back, someone asked this question on Quora: if all humans descended from a single population of ancestors in Africa, how did different skin colors come about? Pippi Maria Groving provided an answer (the first one) that I found intuitively very appealing, although she does mention (and later answers also indicate) that this may be a bit of an oversimplification.

Essentially she states that there are 3 important genes that govern skin color in humans, say A, B and C. Each human has a pair of each, and each pair may be a combination of dominant and recessive versions of these genes, denoted by uppercase and lowercase respectively. Thus, given this model, one can have a combination of 6 possible genes (in pairs) for skin color: A, a, B, b, C, c. The dominant genes produce pigments that darken the skin, so the more dominant genes one has, the darker the skin color. She also provides a 8x8 Punnet Square exhaustively listing out all combinations of genes (called genotypes) and their resulting skin color (phenotypes). Binning the phenotypes into 7 distinct skin colors results in a theoretical ratio of 1:6:15:20:15:6:1 from very dark to very light.

I am currently also doing an edX course (offered by MIT) on Quantitative Biology, so I figured it would be interesting to try to model this as a simulation to see if my experimental results matched up to the theoretical ratios. Since skin color is an adaptation based on the weather, with darker skin providing protection from ultraviolet (UV) light in hot and sunny climates, and lighter skin able to make more vitamin D with limited UV in colder and less sunny climates, I ran another simulation to model that situation. This post is a result of that work.

The functions used for this simulation is provided below. I describe the relevant functions as they are called from the main code below.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
from operator import itemgetter
from random import random
import math
import matplotlib.pyplot as plt
import nltk
import numpy as np

def person():
    alleles = []
    for allele in ['a','b','c']:
        pairs = []
        for pair in range(2):
            pairs.append(allele if random() <= 0.5 else allele.upper())
        alleles.append("".join(sorted(pairs)))
    return alleles

def shuffle_and_choose(counts):
    shuffled = [x[0] for x in sorted(enumerate([random() for i in 
                range(len(counts))]), key=itemgetter(1))]
    return counts[shuffled[0]]

def compute_mating_likelihood(left, right):
    left_dominant = get_num_dominant(left)
    right_dominant = get_num_dominant(right)
    diff = abs(left_dominant - right_dominant)
    return math.exp(-diff)

def mate(left, right):
    mated_alleles = []
    for i in range(3):
        child_pairs = []
        for lp in left[i]:
            for rp in right[i]:
                child_pairs.append("".join(sorted([lp, rp])))
        mated_alleles.append(shuffle_and_choose(child_pairs))
    return mated_alleles

def get_num_dominant(allele):
    return len([c for c in "".join(allele) if c == c.upper()])    
        
def produce_next_generation(curr_gen, region_filter=None):
    next_gen = []
    males = curr_gen[:len(curr_gen)/2]
    females = curr_gen[len(curr_gen)/2:]
    i = 0
    while i < len(curr_gen):
        mptr = int(random() * len(males))
        fptr = int(random() * len(females))
        offspring = mate(males[mptr], females[fptr])
        if region_filter is not None:
            num_dominant = get_num_dominant(offspring)
            if not num_dominant in region_filter:
                if random() > 0.1:
                    continue
        next_gen.append(offspring)
        i = i + 1
    return next_gen
        
SKIN_COLORS = {
  6: 0x111111, 5: 0x6B0000, 4: 0x7B3812, 3: 0xAB671D, 
  2: 0xE0AD87, 1: 0xFDDACA, 0: 0xFEF2DF 
};

def get_color_distrib(curr_gen):
    color_dist = {k:0 for k in SKIN_COLORS.keys()}
    for alleles in curr_gen:
        num_dominant = get_num_dominant(alleles)
        color_dist[num_dominant] = color_dist[num_dominant] + 1
    dist_values = []
    for k in sorted(list(color_dist.keys())):
        dist_values.append(color_dist[k])
    return np.array(dist_values)

def plot_population_chart(color_pop, gen_title):
    xs = [str(hex(SKIN_COLORS[x])).replace("0x", "#") for x in range(7)]
    plt.bar(range(len(xs)), color_pop, color=xs)
    plt.xlabel("Skin Colors")
    plt.ylabel("Frequency")
    plt.xticks([])
    plt.title("Skin Color Distribution: %s" % (gen_title))
    plt.show()

def plot_population_drift(drift_data, title):
    generations = range(drift_data.shape[1])
    xs = range(drift_data.shape[0])
    colors = [str(hex(SKIN_COLORS[x])).replace("0x", "#") for x in xs]
    plt.stackplot(generations, drift_data, baseline="zero", colors=colors)
    plt.xlabel("Generations")
    plt.ylabel("Frequency")
    plt.title("Phenotype Drift:%s" % (title))
    plt.show()

The first part models a situation where the distribution of skin color is truly random. I generate a population of 2,000 individuals, each with a random set of 3 gene pairs (alleles). Then I split the set in half resulting in a pair of sets of 1,000 individuals each. A random member of each set is mated to a random member of the other to produce an offspring - to ensure randomness in the offspring, I exhaustively compute all possibilities for each allele and randomly pick one to create each corresponding offspring allele. This is repeated for 100 generations and the resulting population distribution binned by the 7 skin color phenotypes.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
num_generations = 100

drift_data = np.zeros((len(SKIN_COLORS), num_generations + 1))
curr_gen = [person() for i in range(2000)]
drift_data[:, 0] = get_color_distrib(curr_gen)
plot_population_chart(drift_data[:, 0], "Initial")

for i in range(num_generations):
    next_gen = produce_next_generation(curr_gen)
    drift_data[:, i+1] = get_color_distrib(next_gen)
    curr_gen = next_gen
    
plot_population_chart(drift_data[:, num_generations], "Final")    

print drift_data[:, num_generations]

As can be seen, the distribution of the skin color phenotype in the 100-th generation looks remarkably similar to that of the first generation. The colors are identical to those used in Pippa's answer (thanks to this online Color Picker Tool).






The chart below shows the drift in the phenotype distribution across generations. Once again, the long term trend seems quite flat and unchanging.


The counts of the number of individuals in the different skin color categories in the final generation for me were 39:182:467:637:473:169:33, which is remarkably similar to the theoretical observed ratio as shown below. Note: some numbers were rounded to make them line up for easier visual comparison.

1
2
3
4
5
6
7
8
>>> from __future__ import division
>>> import numpy as np
>>> theoretical = np.array([1, 6, 15, 20, 15, 6, 1])
>>> observed = np.array([39, 182, 467, 637, 473, 169, 33])
>>> theoretical / np.sum(theoretical)
array([ 0.0156, 0.094, 0.2343, 0.3125, 0.2344, 0.0938, 0.0156])
>>> observed / np.sum(observed)
array([ 0.0195, 0.091, 0.2335, 0.3185, 0.2365, 0.0845, 0.0165])

I now tried to simulate the situation where identically random sets of people moved to different geographical regions (5 in my case) with different levels of sunlight. In each region, natural selection would ensure that people of certain skin colors survived. I choose skin color "trigrams" for each region. Thus, for region 1 (cold and dark) I choose the skin color categories (1,2,3), for region 2 I choose categories (2,3,4), and so on. Offspring produced in each generation whose genotypes resolved to one of the "approved" skin colors for the region would survive unconditionally, while others would survive with a 10% chance.

1
2
3
4
5
6
7
8
9
regions = [x for x in nltk.trigrams(range(7))]
for i in range(len(regions)):
    drift_data = np.zeros((len(SKIN_COLORS), num_generations + 1))
    curr_gen = [person() for x in range(2000)]
    for j in range(num_generations):
        next_gen = produce_next_generation(curr_gen, region_filter=set(regions[i]))
        drift_data[:, j+1] = get_color_distrib(next_gen)
        curr_gen = next_gen
    plot_population_drift(drift_data, "Dispersion, Region %d" % (i+1))

This produces the following population drift charts.












As can be seen, each region seems to have a preferred skin color that begins to dominate after a while. So the model, grossly oversimplified as it is, seems to agree with the facts.

I started on this because I was curious if I could build something that approximated reality using randomness (ie random.random() or flipping a coin). I had lots of fun with it, hope you enjoyed reading it also. The code for this is available on my project on GitHub here.

Wednesday, February 11, 2015

Classification with Positive Examples only


I recently came across a problem where I had to identify drug names in text. The drug names could be generic (eg, acetominophen, aspirin, etc) or brand names (Tylenol, Prilosec, etc). The thing about drug names is that they "look" slightly different from normal English words, so I thought that perhaps I could exploit this and build n-gram features for each word in the text and compare it to a model based on n-gram features of known drug names.

For a list of drug names, I used the drugbank.xml file, which contains 7,779 generic and 9,251 brand names. These are my positive examples, ie, I can derive from this set what a drug name should look like. However, I don't have a corresponding list of negative examples. What I do have are the words in the text, which can be thought of as unlabeled examples. The book Web Data Mining by Bing Liu describes several methods to build classifiers out of these inputs - this post describes how I used one of them.

In the past (before I learned of this approach), I would treat this as an unsupervised problem, comparing n-grams of unknown words against the n-gram set I had generated from the drugbank.xml file. Words that had a similarity above some (manually derived) threshold would be considered drugs, others not.

With my newly acquired knowledge, the approach is as follows. First, consider the unlabeled data as negative examples and combine it with the positive examples. Build a classifier, then use the predicted probability of the known positive examples as a lower bound to label the unlabeled examples with predicted probabilities above this lower bound. Repeat until the number of positive examples converge. Finally, use this classifier against another set of data.

The first step is to extract the generic and brand names from the XML file. Because of the size of the file, I decided to use a SAX parser. Here's the code. It parses the generic and drug names out of this file and writes it into a pair of text files.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Source: parse_drugbank.py
# -*- coding: utf-8 -*-
from sklearn.externals import joblib
import drug_ner_utils as dnu
import os
import xml.sax

class DrugXmlContentHandler(xml.sax.ContentHandler):
    
    def __init__(self):
        xml.sax.ContentHandler.__init__(self)
        self.tags = []
        self.generic_names = []
        self.brand_names = []
        
    def startElement(self, name, attrs):
        self.tags.append(name)
    
    def endElement(self, name):
        self.tags.pop()
        
    def characters(self, content):
        breadcrumb = "/".join(self.tags)
        if breadcrumb == "drugbank/drug/brands/brand":
            self.brand_names.append(content)
        if breadcrumb == "drugbank/drug/name":
            self.generic_names.append(content)
    
def write_list_to_file(lst, filename):
    fout = open(os.path.join(dnu.DATA_DIR, filename), 'wb')
    for e in lst:
        fout.write("%s\n" % (e.encode("utf-8")))
    fout.close()

    
source = open(os.path.join(dnu.DATA_DIR, "drugbank.xml"), 'rb')
handler = DrugXmlContentHandler()
xml.sax.parse(source, handler)
source.close()

write_list_to_file(handler.generic_names, "generic_names.txt")
write_list_to_file(handler.brand_names, "brand_names.txt")

generic_fd = dnu.ngram_distrib(handler.generic_names, dnu.GRAM_SIZE)
brand_fd = dnu.ngram_distrib(handler.brand_names, dnu.GRAM_SIZE)

joblib.dump(generic_fd, os.path.join(dnu.DATA_DIR, "generic_fd.pkl"))
joblib.dump(brand_fd, os.path.join(dnu.DATA_DIR, "brand_fd.pkl"))

# Plot visualizations
dnu.plot_ngram_distrib(generic_fd, 30, "Generic", dnu.GRAM_SIZE)
dnu.plot_ngram_distrib(brand_fd, 30, "Brand", dnu.GRAM_SIZE)

This code produces a visualization of the distribution for the top 30 (by frequency) 3-grams, for generic names and brand names. Notice that the distribution is quite different, which is why we used different classifiers to detect each class.






The drug_ner_utils file contain some common functions and constants that are used by multiple processes. They all have fairly descriptive names, so it should be easy to infer what they do. The str_to_ngram function takes a string and the n-gram size to split it into and returns a list of ngrams. The ngram_distrib creates a frequency distribution from a list of tokens, and the plot_ngram_distrib function plots a visualization out of the first (most frequent) n-gram frequencies of the distribution. The truncate_fd is for selecing the first n-gram frequencies, and the vectorize takes a pair of files (positive and unlabeled), where each line corresponds to a single word, represented as a sequence of n-gram "words", and builds the X and y matrices in the format required by the classifier. Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Source: drug_ner_utils.py
# -*- coding: utf-8 -*-
from operator import itemgetter
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import string

DATA_DIR = "../../data/drug_ner"
GRAM_SIZE = 3
PUNCTS = set([c for c in string.punctuation])
NUMBERS = set([c for c in "0123456789"])

def is_punct(c):
    return c in PUNCTS
    
def is_number(c):
    return c in NUMBERS
    
def str_to_ngrams(instring, gram_size):
    ngrams = []
    for word in nltk.word_tokenize(instring.lower()):
        try:
            word = "".join(["S", word, "E"]).encode("utf-8")
            cword = [c for c in word if not(is_punct(c) or is_number(c))]
            ngrams.extend(["".join(x) for x in nltk.ngrams(cword, gram_size)])
        except UnicodeDecodeError:
            pass
    return ngrams
            
def ngram_distrib(names, gram_size):
    tokens = []
    for name in names:
        tokens.extend(str_to_ngrams(name, gram_size))
    return nltk.FreqDist(tokens)

def plot_ngram_distrib(fd, nbest, title, gram_size):
    kvs = sorted([(k, fd[k]) for k in fd], key=itemgetter(1), reverse=True)[0:nbest]
    ks = [k for k, v in kvs]
    vs = [v for k, v in kvs]
    plt.plot(np.arange(nbest), vs)
    plt.xticks(np.arange(nbest), ks, rotation="90")
    plt.title("%d-gram frequency for %s names (Top %d)" % 
              (gram_size, title, nbest))
    plt.xlabel("%d-grams" % (gram_size))
    plt.ylabel("Frequency")
    plt.show()

def truncate_fd(fd, nbest):
    kvs = sorted([(k, fd[k]) for k in fd], key=itemgetter(1), reverse=True)[0:nbest]
    return {k:v for k, v in kvs}

def vectorize(ufile, pfile, max_feats):
    text = []
    labels = []
    fno = 0
    for fname in [ufile, pfile]:
        f = open(os.path.join(DATA_DIR, fname), 'rb')
        for line in f:
            text.append(line.strip())
            labels.append(fno)
        fno = fno + 1
        f.close()
    vec = CountVectorizer(min_df=0.0, max_features=max_feats, binary=True)
    X = vec.fit_transform(text)
    y = np.array(labels)
    return X, y, vec

The next step is converting the extracted names into corresponding n-grams. We use 3-grams for this experiment. The following code reads the files we had written out in the drugbank.xml parsing step and writes out corresponding files of n-grams. Each word in the input file are converted to a space separated sequence of 3-grams, similar to words in a sentence, and the format the downstream vectorizer expects. This results in 10,000 brand name words, 10,365 generic name words, and 199,018 unlabeled words.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Source: ngram_convert.py
# -*- coding: utf-8 -*-
import drug_ner_utils as dnu
import os

def build_ngram_text(infile, outfile):
    fin = open(os.path.join(dnu.DATA_DIR, infile), 'rb')
    fout = open(os.path.join(dnu.DATA_DIR, outfile), 'wb')
    for line in fin:
        for word in line.strip().split():
            ngrams = dnu.str_to_ngrams(word, dnu.GRAM_SIZE)
            if len(ngrams) > 0:
                fout.write("%s\n" % " ".join(ngrams))
    fin.close()
    fout.close()


build_ngram_text("generic_names.txt", "generic_positive.txt")
build_ngram_text("brand_names.txt", "brand_positive.txt")
build_ngram_text("raw_data.txt", "unlabeled.txt")

The next step is called cotraining. It uses the unlabeled set as input, setting it to be the negative example set, and adding to it the positive examples, and training a classifier. The hope is that the classifier will separate the unlabeled set into one that is "close" to the positive set and one that is not. The classifier reports, for each row, the probability that it is in the positive class. If we find the minimum probability for the positive examples (minus outliers), we can consider any row in the unlabeled set as positive if its associated prediction probability is higher than this minimum value. We repeat this step multiple times until the number of positive examples plateau. At each step, we feed it the output targets from the previous step. Here is the code to do this.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# Source: co_train.py
# -*- coding: utf-8 -*-
from sklearn.externals import joblib
from sklearn.svm import LinearSVC
import drug_ner_utils as dnu
import matplotlib.pyplot as plt
import numpy as np
import os

MAX_ITERS = 10
EST_POSITIVE = 0.7
MAX_FEATURES = 3000

def conservative_min(xs):
    # remove outliers
    q25, q75 = np.percentile(xs, [25, 75])
    iqr = q75 - q25
    lb = q25 - (1.5 * iqr)
    ub = q75 + (1.5 * iqr)
    xs_con = xs[(xs >= lb) & (xs <= ub)]
    return np.min(xs_con)
    
    
for borg in ["generic", "brand"]:
    X, y, vec = dnu.vectorize("unlabeled.txt", "%s_positive.txt" % (borg), 
                              MAX_FEATURES)

    y_pos = y[y == 1]
    num_positives = [y_pos.shape[0]]

    clf = LinearSVC()
    clf.fit(X, y)

    num_iters = 0
    while (num_iters < MAX_ITERS):
        print("Iteration #%d, #-positive examples: %d" % 
              (num_iters, num_positives[-1]))
        confidence = clf.decision_function(X)
        min_pos_confidence = conservative_min(confidence[y_pos])
        y_pos = np.where(confidence >= min_pos_confidence)[0]
#        if y_pos.shape[0] <= num_positives[-1]:
#            break
        num_positives.append(y_pos.shape[0])
        y = np.zeros(y.shape)
        y[y_pos] = 1
        clf = LinearSVC()
        clf.fit(X, y)
        joblib.dump(y, os.path.join(dnu.DATA_DIR, "y_%s_%d.pkl" % 
                    (borg, num_iters)))
        num_iters += 1
    
    # visualize output
    plt.plot(np.arange(len(num_positives)), num_positives, "b-")
    plt.plot(np.arange(len(num_positives)), 
             X.shape[0] * EST_POSITIVE * np.ones(len(num_positives)), 'r--')
    plt.title("Cotraining for %s classifier (%d features)" % 
              (borg.title(), MAX_FEATURES))
    plt.xlabel("Iterations")
    plt.ylabel("#-Positives")
    plt.show()

The number of positive examples derived from this process is plotted on the charts below, the first for generic names and the second for brand names. Both classifiers plateau at around the Iteration #2. Our code pickles the classifier at each iteration so we can retrieve the one we need later.






Finally, we apply this classifier to actually detect drug names in our text. Based on the charts above, we choose the generic classifier from iteration #3 and the brand classifier from iteration #4, and pass in vectorized strings for classification. The accuracy scores for the generic and brand classifier (on training data) are 88.4% and 89.4% respectively. However, neither classifier works well on the test data. This is because my unlabeled data is very strongly positive - by my rough estimate, the number of words that are either generic or drug name is about 70% - so the classifier tends to mark everything as positive.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Source: apply_model.py
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
import drug_ner_utils as dnu
import numpy as np
import os

def vectorize_ngrams(ngrams, vocab):
    vec = np.zeros((1, len(vocab)))
    for ngram in ngrams:
        if vocab.has_key(ngram):
            vec[0, vocab[ngram]] = 1
    return vec
    

X, y, generic_vec = dnu.vectorize("unlabeled.txt", "generic_positive.txt", 100)
y = joblib.load(os.path.join(dnu.DATA_DIR, "y_generic_4.pkl"))
generic_clf = LinearSVC()
generic_clf.fit(X, y)
print("Score for generic classifier: %.3f" % (generic_clf.score(X, y)))

X, y, brand_vec = dnu.vectorize("unlabeled.txt", "brand_positive.txt", 100)

y = joblib.load(os.path.join(dnu.DATA_DIR, "y_brand_3.pkl"))
brand_clf = LinearSVC()
brand_clf.fit(X, y)
print("Score for brand classifier: %.3f" % (brand_clf.score(X, y)))

fraw = open(os.path.join(dnu.DATA_DIR, "raw_data.txt"), 'rb')
i = 0
for line in fraw:
    line = line.strip().lower()
    annotated = []
    for word in line.split():
        ngrams = dnu.str_to_ngrams(word, dnu.GRAM_SIZE)
        Xgen = generic_vec.transform([" ".join(ngrams)])
        Xbrand = brand_vec.transform([" ".join(ngrams)])
        is_generic = generic_clf.predict(Xgen)
        is_brand = brand_clf.predict(Xbrand)
        if is_generic == 1:
            annotated.append("<GENERIC>" + word + "</GENERIC>")
        elif is_brand == 1:
            annotated.append("<BRAND>" + word + "</BRAND>")
        else:
            annotated.append(word)
    print("Input: %s" % (line))
    print("Output: %s" % (" ".join(annotated)))
    i += 1
    if i > 10:
        break
fraw.close()

I attempted to correct for this by co-training with very high number of features (3000), so the classifier would learn to be more conservative in its decisions, and then retraining the test classifiers with much smaller number of features (100 and 50 for generic and brand classifiers respectively). I also tried scaling the data before training the classifiers, that did result in an increase in classifier accuracy (on training data) to 98% and 95%, but the resulting classifiers still marked almost all words as drugs, so it was not very effective.

So I guess its back to the drawing board for this one. However, it was interesting to learn of a method that allows us to build classifiers with no negative examples.

The code for this work is spread out across different files, and correspond to a specific step in the process. Communication is via flat files and pickled objects dropped by the component at the previous step. The code can be found on my GitHub project, the drugbank.xml file can be downloaded, and you will have to supply your own unlabeled data.