[Corpora-List] Why is there a log in IDF of TF-IDF?

Andrew Caines cainesap at gmail.com
Fri Jun 1 15:39:13 CEST 2018


Thanks Rafael. Here's a new URL for this paper (assuming this is the right one?), as yours was specific to you https://dl.acm.org/citation.cfm?id=1390409 Andrew

On 1 June 2018 at 11:43, Rafael E. Banchs <rembanchs at i2r.a-star.edu.sg> wrote:


> This can help into the discussion…
>
>
>
> SIGIR 2008 paper giving interesting probabilistic insights on TF-IDF:
> http://delivery.acm.org/10.1145/1400000/1390409/p435-
> roelleke.pdf?ip=192.122.131.36&id=1390409&acc=ACTIVE%20SERVICE&key=
> FF6731C4D3E3CFFF%2E93CCAFF1814A016F%2E4D4702B0C3E38B35%
> 2E4D4702B0C3E38B35&__acm__=1527871192_7c7f910a516290c7852b5934f7f8e870
>
>
>
> Enjoy!
>
>
>
>
>
> *From:* corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] *On Behalf
> Of *Koos Wilt
> *Sent:* Friday, 1 June, 2018 5:36 PM
> *To:* Bob Luk
> *Cc:* liling tan; tbaldwin at gmail.com; corpora
> *Subject:* Re: [Corpora-List] Why is there a log in IDF of TF-IDF?
>
>
>
> Again, the Great Koos brain, a fantastic contraption, holding people
> around the globe enthralled in awe, strikes again: It stretches, yawns,
> seems to be off to a slow start, it belches, and then, unexpectedly, spews
> forth a fraction of its amazing knowledge.
>
>
>
> http://searchivarius.org/blog/tf-idf-simply-cross-entropy
>
>
>
> Send money.
>
>
>
> Seriously, reading this kind of stuff will deepen you understanding of
> what is really going on. NLP formulae are full of equivalencies, the best
> known push-down atomata and context-free languages. But also
> Kullback-Leibler and Multinomial Bayes have been suggested to be the same.
>
>
>
> -K
>
>
>
>
>
>
>
>
>
>
>
> 2018-06-01 10:39 GMT+02:00 Koos Wilt <kooswilt at gmail.com>:
>
> Will look it up. Thanks.
>
>
>
> -K
>
>
>
> 2018-06-01 10:38 GMT+02:00 Bob Luk <csrluk at gmail.com>:
>
> Are you sure it is a cross entropy? You need to sum for all x in
> CrossEntropy(x) = SUM p(x) log q(x). For all x would mean for all words in
> the documents not for all words in the query since the tf is the tf in the
> document.
>
>
>
> Cheers,
>
>
>
> Robert Luk
>
>
>
> On Fri, Jun 1, 2018 at 4:25 PM, Koos Wilt <kooswilt at gmail.com> wrote:
>
> Just FYI and making conversation: did you guys know tf*idf is equivalent
> to Shannon's cross-entropy?
>
>
>
> -K
>
>
>
> 2018-06-01 6:28 GMT+02:00 <tbaldwin at gmail.com>:
>
> First, there is no canonical TF-IDF formulation, and rather TF-IDF is a
> family
> of methods based around a set of intuitions involving TF and DF. But yes,
> you
> are correct that one of the standard implementations logs the IDF (incl in
> BM25), as a means of (monotonically) down-scaling the IDF factor relative
> to the
> TF. Otherwise for large document collections, singleton terms absolutely
> dominate the calculation. There is usually also some additive smoothing of
> the
> DF to avoid high DF terms (in all documents) getting a weight of 0.
>
>
> Tim
>
> On Fri, 2018-06-01 at 10:20 +0800, liling tan wrote:
> > Dear All,
> >
> > Anyone care to answer the question of why is there a log in IDF of
> TF-IDF?
> >
> > Regards,
> > Liling
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > https://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
>
>
>
>
>
> This e-mail and any attachments are only for the use of the intended
> recipient and may contain material that is confidential, privileged and/or
> protected by the Official Secrets Act. If you are not the intended
> recipient, please delete it or notify the sender immediately. Please do not
> copy or use it for any purpose or disclose the contents to any other
> person.
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9919 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180601/1748c5d1/attachment.txt>



More information about the Corpora mailing list