Topic/Document Labelling and "Concepts" Extraction with Word Embeddings

In this notebook we demonstrate the replication of Unsilo's functionality for extracting concepts and document similarities by utilizing word embeddings such as Word2Vec/Doc2Vec

Load some utilities functions:

In [1]:
from utilities import *

Set logging and warning levels:

In [2]:
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)
logging.root.level = logging.INFO  # ipython sometimes messes up the logging setup; restore
warnings.filterwarnings('ignore')

Load the AR articles:

In [3]:
#Load the journals the datafile
train = pd.read_csv( os.path.join('./', 'data', 'journals.txt'), sep='&&&&', names = ["doc_id", "doc_text"], engine='python')
In [4]:
print "Read %d docs" % (train["doc_text"].size)
Read 35014 docs

Load the trained Doc2Vec model:

In [5]:
model=Doc2Vec.load('/local/process/journals.d2v')
#Fix new naming convection between gensim versions
model.docvecs.index2doctag=model.docvecs.offset2doctag

We also have a trained LDA model with 20 topics so let's load all the preprocessed files:

In [6]:
#Load the lda model
lda = models.LdaModel.load('./lda_AR_20.model')
In [7]:
#Load the dictionary
dictionary= corpora.Dictionary.load('./dictionary.dict')
In [8]:
#Load the corpus
corpus = corpora.MmCorpus('corpus.mm')

There are actually 35013 documents (an empty document due to xml parsing has been removed). The lexicon size has been restricted to 73739 terms by cutting of all words that do not appear more than 40 times in each document

Display and store the top-10 most frequent words for each topic as given by LDA:

In [9]:
## First select say top 50 words for each of the 20 LDA topics
top_words = [[word for _,word in lda.show_topic(topicno, topn=50)] for topicno in range(lda.num_topics)]
In [10]:
max_words=10
all_topic_freq_word_lists=[]
for topicno, words in enumerate(top_words):
    print("Topic %i : %s" % (topicno+1, ' '.join(words[:max_words])))
    topic_word_list=[]
    for term in words[:max_words]:
        topic_word_list.append(term)
    all_topic_freq_word_lists.append(topic_word_list)
Topic 1 : structure protein proteins binding figure two molecular structures structural membrane
Topic 2 : virus infection infected host viruses pathogen transmission cells disease cell
Topic 3 : = model 1 models one data energy figure time two
Topic 4 : one would work new research & first time also years
Topic 5 : figure energy 1 surface high temperature gas phase flow materials
Topic 6 : cells growth cancer human hormone bone cell estrogen mice tumor
Topic 7 : transport cuckoo oxygen membrane atp ion iron k+ na+ mitochondrial
Topic 8 : cells cell membrane plasma receptor binding surface endothelial 1 human
Topic 9 : protein receptor proteins binding kinase ca2+ activation 1 cell membrane
Topic 10 : social use research may also cost new costs economic states
Topic 11 : acid enzyme activity acids amino synthesis metabolism glucose compounds 1
Topic 12 : blood disease per pressure may patients renal found normal dogs
Topic 13 : secretion insulin cells resistance food intestinal induced increased effects response
Topic 14 : water soil carbon may & forest per temperature organic forests
Topic 15 : cell cells expression gene drosophila genes development figure actin mouse
Topic 16 : studies genetic may individuals study age variation selection traits life
Topic 17 : muscle brain neurons cells activity stimulation response channels may rat
Topic 18 : dna gene genes rna protein sequence sequences genome two 1
Topic 19 : species may plant plants host eggs size population populations evolution
Topic 20 : et al & 2002 2003 2001 2005 2004 2006 2000

Find the most relevant words for each topic as defined by Sievert and Shirley (2004):

In [11]:
#First get a bow numeric representation of the documents
docs=get_numeric_documents(corpus)
In [12]:
#Now get distributions over words for each topic
word_dists = word_distribution_by_topic(lda,dictionary)
In [13]:
#Compute word relevance per topic
relevance=term_relevance_by_topic(word_dists, docs, dictionary)

Display and store the top-10 most relevant words for each topic:

In [14]:
all_topic_rel_word_lists=[]
for i, topic in enumerate(relevance):
        print "topic", i+1, ":",
        topic_word_list=[]
        for term, _ in topic[:max_words]:
                topic_word_list.append(term)
                print term,
        print
        all_topic_rel_word_lists.append(topic_word_list)
topic 1 : sieving gel structure gels electrophoretic protein polymer structures residues structural
topic 2 : schistosomiasis mansoni infection unparasitized infected virus schistosomes ofinfection schistosome pathogen
topic 3 : dinosaur models = model uta spedding energy yt data eqs
topic 4 : geist loucks volta chihuahua camels zoogeographic weapons opportunism plunges gregorys
topic 5 : energy materials gas figure temperature surface thermal electric phase reservoir
topic 6 : estrogen offemale progesterone bone hormone fetal cancer fsh estradiol ovulation
topic 7 : cuckoo transport na+ k+ atpase oxygen cl− atp h+ schnell
topic 8 : endothelial plasma membrane cell cells adhesion collagen mucin cholesterol membranes
topic 9 : ca2+ kinase receptor protein phosphorylation camp signaling proteins activation binding
topic 10 : agro hairston impoundment social costs legal cost management economic resources
topic 11 : acid enzyme blackbird acids vmax electrophoretically dehydrogenase glucose compounds metabolism
topic 12 : blood giraffe schistosomiasis masticatory ducks worms dogs nestling renal disease
topic 13 : secretion insulin gastric intestinal airway lung alveolar intestine surfactant fat
topic 14 : impoundments detritus humus soil forests montane forest mull dams grubb
topic 15 : drosophila neoteny circadian actin melanogaster expression cell clock cells embryonic
topic 16 : traits monogamous dsm disorder depression disorders genetic cognitive schizophrenia anxiety
topic 17 : lrf muscle neurons brain nerve stimulation channels cortex fibers visual
topic 18 : dna rna gene glabrata genes bovids sequences sequence ofnewly genome
topic 19 : nectaries extrafloral pheasants dinosaurs species ungulates eggs nests savanna birds
topic 20 : et al & facies 2002 2006 2005 2003 2004 2001

Get the topic distributions for each document:

In [15]:
#Get the topic distributions for each document
topic_distributions=[]

for i in range(len(corpus)):
    doc_topic_distributions=get_doc_topics(lda,corpus[i])
    topic_distributions.append(doc_topic_distributions)

Pick a single document and find its predominant topic:

In [16]:
#Let's use the article shown in the first page of the Unsilo demo
doi='10.1146/annurev.pathol.3.121806.151422'

#Find the index of this document in the AR corpus
idx = train[train['doc_id']==doi].index.tolist()
j_index=idx[0]

#Save document's data for this doi in a suitable dataframe
topic_sentences = pd.DataFrame(data=None, columns=train.columns)
k=0
topic_sentences.loc[k, 'doc_id']=train["doc_id"][j_index]
topic_sentences.loc[k, 'doc_text']=train["doc_text"][j_index]

#Find document's predominant topic (i.e. the one in which it has the maximum probability/membership)
max_value=0.0
for i in range(lda.num_topics):
    if (topic_distributions[j_index][i]>=max_value):
        max_value=topic_distributions[j_index][i]
        max_index=i
    
print "Predominant topic for article", '"'+doi+'"', "is topic:", max_index+1
Predominant topic for article "10.1146/annurev.pathol.3.121806.151422" is topic: 6

Let's remember what topic 6 is about by showing its most frequent and relevant words:

In [17]:
print("Most frequent words for Topic %i : %s" % (max_index+1, ' '.join(top_words[max_index][:max_words])))
print
print "Most relevant words for topic", max_index+1, ":", 
for term, _ in relevance[max_index][:max_words]:
    print term,
        
Most frequent words for Topic 6 : cells growth cancer human hormone bone cell estrogen mice tumor

Most relevant words for topic 6 : estrogen offemale progesterone bone hormone fetal cancer fsh estradiol ovulation

Judging by the above, the topic should be about cancer, and especially female related cancers

We will chunk parse our document in order to extract chunks that contain at least words in the top-10 words (frequent or relevant) in topic 6, as candidate labels/concepts.

Let's first search with the top-10 most frequent words in topic 6

In [18]:
#List of proper NLP grammars for extracting chunks that have been tested to work well
proper_grammars=[0,1,2,3,6,7,8,9,13]

doc_name='"'+train["doc_id"][j_index]+'"'
print "Concepts for document", doc_name, "using frequent words from its predominant topic:"
print

for g in proper_grammars:

    f_chunks_list=extract_chunks_new(topic_sentences,all_topic_freq_word_lists[max_index],g)

    #Let's find the K Nearest Chunks (Neighbours) of this document doi in the Doc2Vec space
    
    #Dimensionality of feature space
    f=model.syn0.shape[1]

    #Use the fast approximate KNN Annoy method
    t = AnnoyIndex(f)

    #Add the document itself as the first (0-th) element of the Annoy index
    t.add_item(0,model.docvecs[train["doc_id"][j_index]])

    #For each extracted chunk we use its average word vectors and add them to the Annoy index 
    for j in range(len(f_chunks_list)):
        v = np.zeros((model.syn0.shape[1],),dtype="float32")
        c_list=remove_punk(f_chunks_list[j].lower()).split()
        for k in range(len(c_list)):
            if c_list[k] in model.vocab:
                v = np.add(v, model[c_list[k]])
        v = np.divide(v,len(c_list))
        t.add_item(j+1, v)

    #Annoy size of KNN tress
    t.build(500)

    #Find the K=10 Nearest Neighbours of item 0 (which is doi)
    nns=t.get_nns_by_item(0, 10)


    print "For grammar", g , ":"

    print
    for j in nns[1:]:
        print ("%s,") %(f_chunks_list[j-1]),

    print
    print
Concepts for document "10.1146/annurev.pathol.3.121806.151422" using frequent words from its predominant topic:

For grammar 0 :

brca1 brca2, breast cancer susceptibility gene brca2, human breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2, brca1 p53, brca1 brca2 mutation carriers, brca2 p53 nullizygous embryos, brca2-associated breast tumours, human brca2,

For grammar 1 :

brca1 brca2, brca2 p53, brca1 p53, brca2-associated breast, cellular functions brca gene products instability, brca1-mutant breast, colorectal tumorigenesis, breast cancer, chromosomal instability model pathogenesis brca-associated cancers,

For grammar 2 :

human breast tumorigenesis, breast cancer, breast cancer susceptibility gene, brca2-dss1-ssdna structure, hereditary breast, breast cancer predisposition, rad51-brca2 complex, brca2-rad51 complex, breast cancer susceptibility protein,

For grammar 3 :

human breast tumorigenesis, breast cancer, breast cancer susceptibility gene, breast cancer predisposition, brca2-rad51 complex, breast cancer susceptibility protein, ovarian cancer, breast cancer susceptibility, breast cancer risk overall,

For grammar 6 :

brca1 brca2, breast cancer susceptibility gene brca2, human breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2, breast cancer cells disrupts brca2-rad51 complex, brca2 cellular, ovarian cancer brca, brca1 p53, brca1 brca2 mutation carriers,

For grammar 7 :

brca1 brca2, brca2 p53, breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2, brca1 p53, brca1 brca2 mutation carriers, rab163 mouse brca2, tumor suppressor gene brca1, brca2 protein,

For grammar 8 :

primary breast cancer, hereditary breast cancer, brca2-associated breast, brca1-linked breast cancer, familial breast cancer predisposition, brca1-deficient breast cancer, human breast tumorigenesis, brca1-mutant breast, human breast cancer susceptibility gene,

For grammar 9 :

brca1 brca2, brca2 p53, breast cancer susceptibility gene brca2, human breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2, brca1 p53, brca1 brca2 mutation carriers, brca2-associated breast tumours, human brca2,

For grammar 13 :

brca1 brca2-associated breast tumours, brca1 brca2, hereditary breast ovarian cancer brca, breast cancer susceptibility genes brca1 brca2 arises, brca2 p53, brca1 brca2 mutation display distinct mammary gland ovarian phenotypes, breast cancer susceptibility gene brca2, human breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2,

Now let's first search with the top-10 most relevant words in topic 6

In [19]:
#List of proper NLP grammars for extracting chunks that have been tested to work well
proper_grammars=[0,1,2,3,6,7,8,9,13]

doc_name='"'+train["doc_id"][j_index]+'"'
print "Concepts for document", doc_name, "using relevant words from its predominant topic:"
print


for g in proper_grammars:
    
    r_chunks_list=extract_chunks_new(topic_sentences,all_topic_rel_word_lists[max_index],g)

    #For these new extracted chunks, let's find the K Nearest Neighbours of this document doi in the Doc2Vec space
    
    #Dimensionality of feature space
    f=model.syn0.shape[1]

    #Use the fast approximate KNN Annoy method
    t = AnnoyIndex(f)

    #Add the document itself as the first (0-th) element of the Annoy index
    t.add_item(0,model.docvecs[train["doc_id"][j_index]])

    #For each extracted chunk we use its average word vectors and add them to the Annoy index 
    for j in range(len(r_chunks_list)):
        v = np.zeros((model.syn0.shape[1],),dtype="float32")
        c_list=remove_punk(r_chunks_list[j].lower()).split()
        for k in range(len(c_list)):
            if c_list[k] in model.vocab:
                v = np.add(v, model[c_list[k]])
        v = np.divide(v,len(c_list))
        t.add_item(j+1, v)

    #Annoy size of KNN tress
    t.build(500)

    #Find the K=10 Nearest Neighbours of item 0 (which is doi)
    nns=t.get_nns_by_item(0, 10)

    print "For grammar", g , ":"

    print
    for j in nns[1:]:
        print ("%s,") %(r_chunks_list[j-1]),

    print
    print
Concepts for document "10.1146/annurev.pathol.3.121806.151422" using relevant words from its predominant topic:

For grammar 0 :

brca1 brca2, breast cancer susceptibility gene brca2, human breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2, brca1 p53, brca1 brca2 mutation carriers, brca2 p53 nullizygous embryos, brca2-associated breast tumours, primary breast cancer,

For grammar 1 :

brca1 brca2, brca2 p53, brca1 p53, brca2-associated breast, cellular functions brca gene products instability, colorectal tumorigenesis, breast cancer, chromosomal instability model pathogenesis brca-associated cancers, breast tumor,

For grammar 2 :

human breast tumorigenesis, breast cancer, breast cancer susceptibility gene, brca2-dss1-ssdna structure, hereditary breast, breast cancer predisposition, rad51-brca2 complex, brca2-rad51 complex, breast cancer susceptibility protein,

For grammar 3 :

human breast tumorigenesis, breast cancer, breast cancer susceptibility gene, breast cancer predisposition, brca2-rad51 complex, breast cancer susceptibility protein, ovarian cancer, breast cancer susceptibility, breast cancer risk overall,

For grammar 6 :

brca1 brca2, breast cancer susceptibility gene brca2, human breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2, breast cancer cells disrupts brca2-rad51 complex, brca2 cellular, brca1 p53, brca1 brca2 mutation carriers, brca2 p53 nullizygous embryos,

For grammar 7 :

brca1 brca2, brca2 p53, breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2, brca1 p53, brca1 brca2 mutation carriers, rab163 mouse brca2, tumor suppressor gene brca1, brca2 protein,

For grammar 8 :

primary breast cancer, hereditary breast cancer, brca2-associated breast, brca1-linked breast cancer, familial breast cancer predisposition, brca1-deficient breast cancer, human breast tumorigenesis, human breast cancer susceptibility gene, colorectal tumorigenesis,

For grammar 9 :

brca1 brca2, brca2 p53, breast cancer susceptibility gene brca2, human breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2, brca1 p53, brca1 brca2 mutation carriers, brca2-associated breast tumours, primary breast cancer,

For grammar 13 :

brca1 brca2-associated breast tumours, brca1 brca2, breast cancer susceptibility genes brca1 brca2 arises, brca2 p53, brca1 brca2 mutation display distinct mammary gland ovarian phenotypes, breast cancer susceptibility gene brca2, human breast cancer susceptibility gene brca2, breast cancer susceptibility protein brca2, brca1 brca2-associated breast gynecologic cancer,

Thus, we have extracted concepts/labels for each document

We can also find and similarities between topics and documents and similarities between concepts/labels and topics

First we get the topic distributions over the documents (each topic is a distribution over the documents)

In [20]:
document_distributions=np.array(topic_distributions).transpose()

Let's find the weighted average vector of each topic in the Doc2Vec space composed from the documents that have membership (probability)>=0.2 in each topic

In [21]:
threshold=0.2
topicVecs=[]

for i in range(lda.num_topics):
#for i in range(1):

    featureVec = np.zeros((model.syn0.shape[1],),dtype="float32")
    k=0
    sum_m=0
    for j in range(document_distributions.shape[1]):
        if (document_distributions[i][j]>=threshold):
            #featureVec = np.add(featureVec,model.docvecs[train["doc_id"][j]])
            featureVec = np.add(featureVec,np.multiply(model.docvecs[train["doc_id"][j]],document_distributions[i][j]))
            k=k+1
            sum_m=sum_m+document_distributions[i][j]

    featureVec = np.divide(featureVec,sum_m)

    print ("Topic %i : Number of documents in this topic: %i") %(i+1,k)
    topicVecs.append(featureVec)
Topic 1 : Number of documents in this topic: 2569
Topic 2 : Number of documents in this topic: 1446
Topic 3 : Number of documents in this topic: 5148
Topic 4 : Number of documents in this topic: 5983
Topic 5 : Number of documents in this topic: 5586
Topic 6 : Number of documents in this topic: 1640
Topic 7 : Number of documents in this topic: 1334
Topic 8 : Number of documents in this topic: 1239
Topic 9 : Number of documents in this topic: 2228
Topic 10 : Number of documents in this topic: 5263
Topic 11 : Number of documents in this topic: 4167
Topic 12 : Number of documents in this topic: 3321
Topic 13 : Number of documents in this topic: 1521
Topic 14 : Number of documents in this topic: 1835
Topic 15 : Number of documents in this topic: 1722
Topic 16 : Number of documents in this topic: 2260
Topic 17 : Number of documents in this topic: 2634
Topic 18 : Number of documents in this topic: 3205
Topic 19 : Number of documents in this topic: 3850
Topic 20 : Number of documents in this topic: 965

Find the similarity of our current document with the topic vectors in the Doc2Vec space:

In [22]:
 #Dimensionality of feature space
f=model.syn0.shape[1]

#Use the fast approximate KNN Annoy method
t = AnnoyIndex(f)

#Add the document itself as the first (0-th) element of the Annoy index
t.add_item(0,model.docvecs[train["doc_id"][j_index]])

for j in range(lda.num_topics):
    t.add_item(j+1,topicVecs[j])

#Annoy size of KNN tress
t.build(10)

#Find the K=10 Nearest Neighbours of item 0 (which is doi)
nns=t.get_nns_by_item(0, 10)


print "Topics similar to document", doc_name, ":"

for j in nns[1:]:
    print ("Topic %d,") %(j),
print
Topics similar to document "10.1146/annurev.pathol.3.121806.151422" :
Topic 6, Topic 15, Topic 9, Topic 18, Topic 13, Topic 16, Topic 2, Topic 8, Topic 17,

Find the similarity of any extracted concept with the topic vectors in the Doc2Vec space:

In [23]:
#First extract a concept (chunk), e.g. "brca1 brca2-associated breast gynecologic cancer" 
print r_chunks_list[525]
brca1 brca2-associated breast gynecologic cancer
In [24]:
#Dimensionality of feature space
f=model.syn0.shape[1]

#Use the fast approximate KNN Annoy method
t = AnnoyIndex(f)


#For this chunk we use its average word vectors and add them to the Annoy index 

v = np.zeros((model.syn0.shape[1],),dtype="float32")
c_list=remove_punk(r_chunks_list[525].lower()).split()
for k in range(len(c_list)):
    if c_list[k] in model.vocab:
        v = np.add(v, model[c_list[k]])
v = np.divide(v,len(c_list))

#Add the chunk list itself as the first (0-th) element of the Annoy index
t.add_item(0,v)

for j in range(lda.num_topics):
    t.add_item(j+1,topicVecs[j])

#Annoy size of KNN tress
t.build(10)

#Find the K=10 Nearest Neighbours of item 0 (which is doi)
nns=t.get_nns_by_item(0, 10)

print "Topics similar to concept " + '"'+r_chunks_list[525]+'"' + ":"
  
for j in nns[1:]:
    print ("Topic %d,") %(j),
print
Topics similar to concept "brca1 brca2-associated breast gynecologic cancer":
Topic 6, Topic 13, Topic 16, Topic 9, Topic 15, Topic 2, Topic 8, Topic 12, Topic 18,

Find most similar documents to the current document in the Doc2Vec space:

In [25]:
docvec = model.docvecs[train["doc_id"][j_index]]
print model.docvecs.most_similar([docvec])
[('10.1146/annurev.pathol.3.121806.151422', 0.9999998807907104), ('10.1146/annurev-genet-102108-134222', 0.6315860152244568), ('10.1146/annurev.genom.7.080505.115648', 0.6228210926055908), ('10.1146/annurev-genet-110410-132435', 0.6022162437438965), ('10.1146/annurev.genom.2.1.41', 0.600504457950592), ('10.1146/annurev-genet-051710-150955', 0.582023561000824), ('10.1146/annurev-biophys-051013-022737', 0.574215292930603), ('10.1146/annurev.genet.35.102401.090432', 0.5680429935455322), ('10.1146/annurev-med-081313-121208', 0.5654281973838806), ('10.1146/annurev.genet.36.060402.113540', 0.563448429107666)]

Find most similar documents to the current document given a concept in the Doc2Vec space:

In [26]:
docvec = model.docvecs[train["doc_id"][j_index]]

#For this concept ("brca1 brca2-associated breast gynecologic cancer") we use its average word vectors 
v = np.zeros((model.syn0.shape[1],),dtype="float32")
c_list=remove_punk(r_chunks_list[525].lower()).split()
for k in range(len(c_list)):
    if c_list[k] in model.vocab:
        v = np.add(v, model[c_list[k]])
v = np.divide(v,len(c_list))

composite_doc_vec=np.add(docvec,v)

composite_doc_vec= np.divide(composite_doc_vec,2)

print model.docvecs.most_similar([composite_doc_vec])
[('10.1146/annurev.pathol.3.121806.151422', 0.9638024568557739), ('10.1146/annurev.genom.7.080505.115648', 0.6152673959732056), ('10.1146/annurev-genet-102108-134222', 0.604752779006958), ('10.1146/annurev.genom.2.1.41', 0.5785870552062988), ('10.1146/annurev.genet.35.102401.090432', 0.5734236240386963), ('10.1146/annurev.genom.4.070802.110341', 0.5656291246414185), ('10.1146/annurev.med.49.1.425', 0.5623054504394531), ('10.1146/annurev.genet.32.1.95', 0.557499349117279), ('10.1146/annurev-med-050913-022545', 0.5562744140625), ('10.1146/annurev-genet-110410-132435', 0.55494225025177)]