[Corpora-List] Why is there a log in IDF of TF-IDF?

Koos Wilt kooswilt at gmail.com
Fri Jun 1 10:39:08 CEST 2018


Will look it up. Thanks.

-K

2018-06-01 10:38 GMT+02:00 Bob Luk <csrluk at gmail.com>:


> Are you sure it is a cross entropy? You need to sum for all x in
> CrossEntropy(x) = SUM p(x) log q(x). For all x would mean for all words in
> the documents not for all words in the query since the tf is the tf in the
> document.
>
> Cheers,
>
> Robert Luk
>
> On Fri, Jun 1, 2018 at 4:25 PM, Koos Wilt <kooswilt at gmail.com> wrote:
>
>> Just FYI and making conversation: did you guys know tf*idf is equivalent
>> to Shannon's cross-entropy?
>>
>> -K
>>
>> 2018-06-01 6:28 GMT+02:00 <tbaldwin at gmail.com>:
>>
>>> First, there is no canonical TF-IDF formulation, and rather TF-IDF is a
>>> family
>>> of methods based around a set of intuitions involving TF and DF. But
>>> yes, you
>>> are correct that one of the standard implementations logs the IDF (incl
>>> in
>>> BM25), as a means of (monotonically) down-scaling the IDF factor
>>> relative to the
>>> TF. Otherwise for large document collections, singleton terms absolutely
>>> dominate the calculation. There is usually also some additive smoothing
>>> of the
>>> DF to avoid high DF terms (in all documents) getting a weight of 0.
>>>
>>>
>>> Tim
>>>
>>> On Fri, 2018-06-01 at 10:20 +0800, liling tan wrote:
>>> > Dear All,
>>> >
>>> > Anyone care to answer the question of why is there a log in IDF of
>>> TF-IDF?
>>> >
>>> > Regards,
>>> > Liling
>>> > _______________________________________________
>>> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> > Corpora mailing list
>>> > Corpora at uib.no
>>> > https://mailman.uib.no/listinfo/corpora
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> https://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
>>
>>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3785 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180601/d7689e4a/attachment.txt>



More information about the Corpora mailing list