# [Corpora-List] Local Similarity in LDA

Yashar Najafloo yasharnajafloo at yahoo.com
Thu May 19 23:24:40 CEST 2016

Hi Alex, Thanks for your comment. Yes exactly. The equation I gave was the similarity between two words across the whole corpus (global similarity) and my question is how to calculate the similarity of two words in a specific documents (local similarity) given the fact we already know how much each document is related to different topics. I mean I would like an equation mirroring topic proportions of individual documents as well. Having said that, my logic says if we know how probable is for a word to appear throughput the topics and how probable is for the topics to be the subject of a document, we can then calculate local similarity by multiplying topic proportions of a given document into the rows of topic-word table (the table we worked out the global similarity) and the wok out the math again and call it local similarity. Does what I am saying make any sense and is it mathematically correct?  Regards,Yashar

On Thursday, 19 May 2016 7:34 PM, Alexander Yeh <asy at mitre.org> wrote:

Yashar Najafloo wrote:
> Hi there,
>
> I have a question with regards to similarity between two words in LDA
> (Latent Dirichlet Allocation) and was wondering if anyone can kindly
> help me out.
> I'll try to keep it short.
>
> I have a corpus and analysed it using LDA and Variational Inference. I
> now know how much documents are about different topics and how much each
> topic is about different words in my word list. I know the similarity
> between two words can be calculated by the amount of topic two share
> which is sum of (say 10 topics) conditional probability of word one
> given topic z multiplied in conditional probability of topic z given
> word two.
> P(w1|w2)=SUM (z=1 to 10) [P(w1|z)P(z|w2)]

The above looks to be the similarity of 2 words according to the LDA topic models. It may be interesting to compare this with a P(w1|w2) calculated directly from the documents themselves: (# of documents with both w1 and w2)/(# of documents with w2)

-Alex

>
> The question is how to calculate the similarity of two words in
> particular documents (we know how much the documents are about topics).
> I was thinking of taking the topic proportion of documents as weights,
> multiply in the topics given their weights and work out the above
> mentioned math. Is what I am trying to achieve mathematically correct?
>
> Regards,
> Yashar
>
>
> _______________________________________________