[Corpora-List] Log-likelihood (was : Re: Questions about t-score)

Emmanuel Prochasson emmanuel.prochasson
Sat Apr 25 16:08:23 CEST 2009

Stefan Evert a écrit :
> I cannot resist a little bit of self-promotion: you might want to look
> at my PhD thesis
> Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs
> and Collocations. Dissertation, Institut für maschinelle
> Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714
> .
> or this handbook chapter
> Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M.
> Kytö (eds.), Corpus Linguistics. An International Handbook, chapter
> 58. Mouton de Gruyter, Berlin.
> which have extensive discussions of statistical measures of
> association. Both can be downloaded from my homepage (see below).
I read both this documents with the greatest interest, since I've been intensively using association measures. I have a question regarding log-likelihood computed from contingency table. In some case, I obtain nil values for O_12 or O_21 values (following your notations). Therefore, the log-likelihood is undefined, because log(O_12/E_12) (or log(O_21/E_21)) is undefined. However, nil values for O_12 or O_21 is of great interest, it show that both token are highly related, since when of them /never appears/ without the other.

How to handle such situation to keep a balanced, homogenous score. Most of the time, nil values are simply ignored (log(O_12/E_12) is simply replaced by 0), but I feel the log-likelihood computed that way can not be correctly interpreted anymore. Adding "jitters" to nil value does not seem to be clever, since the log function decrease quickly between 1 and 0 (the jitter choice will have a huge influence).

I'll be interest in any clue to manage those situations.


-- Emmanuel

More information about the Corpora mailing list