[Corpora-List] Corpus size and accuracy of frequency listings

Miles Osborne miles
Thu Apr 2 15:14:27 CEST 2009

This depends to a large extent upon the nature of the data. For the head of the distribution, it is likely to be consistent across a range of sizes and samples (words like "the" and the like are always common). The tail is likely to vary in non-trivial ways.

We actually looked at this problem a long time ago and found that for some words, as you see more data, you get a monotonically increasingly better estimate of what it should be, assuming seeing all of the data as a yardstick. But for other words --and I don't mean obscure ones-- odd patterns happen.


James Curran and Miles Osborne. A very very large corpus doesn't always yield reliable estimates. Joint CoNLL02 - Workshop on Very Large Corpora, Taipei, Taiwan. 2002 http://www.cogsci.ed.ac.uk/~osborne/convergence.ps.gz

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

More information about the Corpora mailing list