[Corpora-List] Why is there a log in IDF of TF-IDF?

Bob Luk csrluk at gmail.com
Fri Jun 1 10:38:23 CEST 2018


Are you sure it is a cross entropy? You need to sum for all x in CrossEntropy(x) = SUM p(x) log q(x). For all x would mean for all words in the documents not for all words in the query since the tf is the tf in the document.

Cheers,

Robert Luk

On Fri, Jun 1, 2018 at 4:25 PM, Koos Wilt <kooswilt at gmail.com> wrote:


> Just FYI and making conversation: did you guys know tf*idf is equivalent
> to Shannon's cross-entropy?
>
> -K
>
> 2018-06-01 6:28 GMT+02:00 <tbaldwin at gmail.com>:
>
>> First, there is no canonical TF-IDF formulation, and rather TF-IDF is a
>> family
>> of methods based around a set of intuitions involving TF and DF. But yes,
>> you
>> are correct that one of the standard implementations logs the IDF (incl in
>> BM25), as a means of (monotonically) down-scaling the IDF factor relative
>> to the
>> TF. Otherwise for large document collections, singleton terms absolutely
>> dominate the calculation. There is usually also some additive smoothing
>> of the
>> DF to avoid high DF terms (in all documents) getting a weight of 0.
>>
>>
>> Tim
>>
>> On Fri, 2018-06-01 at 10:20 +0800, liling tan wrote:
>> > Dear All,
>> >
>> > Anyone care to answer the question of why is there a log in IDF of
>> TF-IDF?
>> >
>> > Regards,
>> > Liling
>> > _______________________________________________
>> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> > Corpora mailing list
>> > Corpora at uib.no
>> > https://mailman.uib.no/listinfo/corpora
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3317 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180601/357a3fbf/attachment.txt>



More information about the Corpora mailing list