[Corpora-List] Local Similarity in LDA

Yashar Najafloo yasharnajafloo at yahoo.com
Thu May 19 23:24:40 CEST 2016

Hi Alex, Thanks for your comment. Yes exactly. The equation I gave was the similarity between two words across the whole corpus (global similarity) and my question is how to calculate the similarity of two words in a specific documents (local similarity) given the fact we already know how much each document is related to different topics. I mean I would like an equation mirroring topic proportions of individual documents as well. Having said that, my logic says if we know how probable is for a word to appear throughput the topics and how probable is for the topics to be the subject of a document, we can then calculate local similarity by multiplying topic proportions of a given document into the rows of topic-word table (the table we worked out the global similarity) and the wok out the math again and call it local similarity. Does what I am saying make any sense and is it mathematically correct?  Regards,Yashar

On Thursday, 19 May 2016 7:34 PM, Alexander Yeh <asy at mitre.org> wrote:

Yashar Najafloo wrote:
> Hi there,
> I have a question with regards to similarity between two words in LDA
> (Latent Dirichlet Allocation) and was wondering if anyone can kindly
> help me out.
> I'll try to keep it short.
> I have a corpus and analysed it using LDA and Variational Inference. I
> now know how much documents are about different topics and how much each
> topic is about different words in my word list. I know the similarity
> between two words can be calculated by the amount of topic two share
> which is sum of (say 10 topics) conditional probability of word one
> given topic z multiplied in conditional probability of topic z given
> word two.
> P(w1|w2)=SUM (z=1 to 10) [P(w1|z)P(z|w2)]

The above looks to be the similarity of 2 words according to the LDA topic models. It may be interesting to compare this with a P(w1|w2) calculated directly from the documents themselves: (# of documents with both w1 and w2)/(# of documents with w2)


> The question is how to calculate the similarity of two words in
> particular documents (we know how much the documents are about topics).
> I was thinking of taking the topic proportion of documents as weights,
> multiply in the topics given their weights and work out the above
> mentioned math. Is what I am trying to achieve mathematically correct?
> Regards,
> Yashar
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4868 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160519/f8dd55d5/attachment.txt>

More information about the Corpora mailing list