[Corpora-List] Questions about t-score

Stefan Evert stefan.evert
Thu Apr 9 19:02:03 CEST 2009



> I am writing to know whether the t-score used in corpus
> analysis is the same t-score used in regular statistics.

I'm afraid so, which means that it's entirely inapplicable to corpus data and the resulting p-values cannot be interpreted in any meaningful way. I complain about this at length here:

http://www.collocations.de/AM/section4.html#s4.1


> That is, if I am,
> for example, looking for the collocation ?wing? and ?angel,? and I
> find
> that these two words occur together 75 times with a t-score value of
> 4.3,
> can I say that the df (degree of freedom) is 75-1=74, and then go to
> the
> t-score table and try to find whether my result is statistically
> significant, i.e. p<0.05?

No, because the assumption made by the test are so far off the mark in this case that the test statistic doesn't even remotely follow a t distribution. Empirical results and simulation experiments show that t-score underestimates significance drastically (i.e. p-values are much higher than for the mathematically appropriate Fisher exact test); this behaviour is often desirable in the context of collocation extraction, which accounts for the popularity of t-score.

If you really want to calculate p-values, you should use Fisher's test on 2x2 contingency tables. You'll find, though, that most word pairs appear to be significant with p < .05 (and even quite often p < .001).

I cannot resist a little bit of self-promotion: you might want to look at my PhD thesis

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation, Institut für maschinelle Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714 .

or this handbook chapter

Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58. Mouton de Gruyter, Berlin.

which have extensive discussions of statistical measures of association. Both can be downloaded from my homepage (see below).

Best regards, Stefan Evert

[ stefan.evert at uos.de | http://purl.org/stefan.evert ]



More information about the Corpora mailing list