[Corpora-List] Why is there a log in IDF of TF-IDF?

Rafael E. Banchs rembanchs at i2r.a-star.edu.sg
Fri Jun 1 12:43:51 CEST 2018


This can help into the discussion…

SIGIR 2008 paper giving interesting probabilistic insights on TF-IDF: http://delivery.acm.org/10.1145/1400000/1390409/p435-roelleke.pdf?ip=192.122.131.36&id=1390409&acc=ACTIVE%20SERVICE&key=FF6731C4D3E3CFFF%2E93CCAFF1814A016F%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1527871192_7c7f910a516290c7852b5934f7f8e870

Enjoy!

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Koos Wilt Sent: Friday, 1 June, 2018 5:36 PM To: Bob Luk Cc: liling tan; tbaldwin at gmail.com; corpora Subject: Re: [Corpora-List] Why is there a log in IDF of TF-IDF?

Again, the Great Koos brain, a fantastic contraption, holding people around the globe enthralled in awe, strikes again: It stretches, yawns, seems to be off to a slow start, it belches, and then, unexpectedly, spews forth a fraction of its amazing knowledge.

http://searchivarius.org/blog/tf-idf-simply-cross-entropy

Send money.

Seriously, reading this kind of stuff will deepen you understanding of what is really going on. NLP formulae are full of equivalencies, the best known push-down atomata and context-free languages. But also Kullback-Leibler and Multinomial Bayes have been suggested to be the same.

-K

2018-06-01 10:39 GMT+02:00 Koos Wilt <kooswilt at gmail.com<mailto:kooswilt at gmail.com>>: Will look it up. Thanks.

-K

2018-06-01 10:38 GMT+02:00 Bob Luk <csrluk at gmail.com<mailto:csrluk at gmail.com>>: Are you sure it is a cross entropy? You need to sum for all x in CrossEntropy(x) = SUM p(x) log q(x). For all x would mean for all words in the documents not for all words in the query since the tf is the tf in the document.

Cheers,

Robert Luk

On Fri, Jun 1, 2018 at 4:25 PM, Koos Wilt <kooswilt at gmail.com<mailto:kooswilt at gmail.com>> wrote: Just FYI and making conversation: did you guys know tf*idf is equivalent to Shannon's cross-entropy?

-K

2018-06-01 6:28 GMT+02:00 <tbaldwin at gmail.com<mailto:tbaldwin at gmail.com>>: First, there is no canonical TF-IDF formulation, and rather TF-IDF is a family of methods based around a set of intuitions involving TF and DF. But yes, you are correct that one of the standard implementations logs the IDF (incl in BM25), as a means of (monotonically) down-scaling the IDF factor relative to the TF. Otherwise for large document collections, singleton terms absolutely dominate the calculation. There is usually also some additive smoothing of the DF to avoid high DF terms (in all documents) getting a weight of 0.

Tim

On Fri, 2018-06-01 at 10:20 +0800, liling tan wrote:
> Dear All,
>
> Anyone care to answer the question of why is there a log in IDF of TF-IDF?
>
> Regards,
> Liling
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no<mailto:Corpora at uib.no>
> https://mailman.uib.no/listinfo/corpora

_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no<mailto:Corpora at uib.no> https://mailman.uib.no/listinfo/corpora

_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no<mailto:Corpora at uib.no> https://mailman.uib.no/listinfo/corpora

This e-mail and any attachments are only for the use of the intended recipient and may contain material that is confidential, privileged and/or protected by the Official Secrets Act. If you are not the intended recipient, please delete it or notify the sender immediately. Please do not copy or use it for any purpose or disclose the contents to any other person. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 10210 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180601/af3c85bc/attachment.txt>



More information about the Corpora mailing list