[Corpora-List] Why is there a log in IDF of TF-IDF?

Koos Wilt kooswilt at gmail.com
Fri Jun 1 11:35:57 CEST 2018


Again, the Great Koos brain, a fantastic contraption, holding people around the globe enthralled in awe, strikes again: It stretches, yawns, seems to be off to a slow start, it belches, and then, unexpectedly, spews forth a fraction of its amazing knowledge.

http://searchivarius.org/blog/tf-idf-simply-cross-entropy

Send money.

Seriously, reading this kind of stuff will deepen you understanding of what is really going on. NLP formulae are full of equivalencies, the best known push-down atomata and context-free languages. But also Kullback-Leibler and Multinomial Bayes have been suggested to be the same.

-K

2018-06-01 10:39 GMT+02:00 Koos Wilt <kooswilt at gmail.com>:


> Will look it up. Thanks.
>
> -K
>
> 2018-06-01 10:38 GMT+02:00 Bob Luk <csrluk at gmail.com>:
>
>> Are you sure it is a cross entropy? You need to sum for all x in
>> CrossEntropy(x) = SUM p(x) log q(x). For all x would mean for all words in
>> the documents not for all words in the query since the tf is the tf in the
>> document.
>>
>> Cheers,
>>
>> Robert Luk
>>
>> On Fri, Jun 1, 2018 at 4:25 PM, Koos Wilt <kooswilt at gmail.com> wrote:
>>
>>> Just FYI and making conversation: did you guys know tf*idf is equivalent
>>> to Shannon's cross-entropy?
>>>
>>> -K
>>>
>>> 2018-06-01 6:28 GMT+02:00 <tbaldwin at gmail.com>:
>>>
>>>> First, there is no canonical TF-IDF formulation, and rather TF-IDF is a
>>>> family
>>>> of methods based around a set of intuitions involving TF and DF. But
>>>> yes, you
>>>> are correct that one of the standard implementations logs the IDF (incl
>>>> in
>>>> BM25), as a means of (monotonically) down-scaling the IDF factor
>>>> relative to the
>>>> TF. Otherwise for large document collections, singleton terms absolutely
>>>> dominate the calculation. There is usually also some additive smoothing
>>>> of the
>>>> DF to avoid high DF terms (in all documents) getting a weight of 0.
>>>>
>>>>
>>>> Tim
>>>>
>>>> On Fri, 2018-06-01 at 10:20 +0800, liling tan wrote:
>>>> > Dear All,
>>>> >
>>>> > Anyone care to answer the question of why is there a log in IDF of
>>>> TF-IDF?
>>>> >
>>>> > Regards,
>>>> > Liling
>>>> > _______________________________________________
>>>> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>>> > Corpora mailing list
>>>> > Corpora at uib.no
>>>> > https://mailman.uib.no/listinfo/corpora
>>>>
>>>> _______________________________________________
>>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> https://mailman.uib.no/listinfo/corpora
>>>>
>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> https://mailman.uib.no/listinfo/corpora
>>>
>>>
>>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5138 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180601/e500fe32/attachment.txt>



More information about the Corpora mailing list