[Corpora-List] Corpus size and accuracy of frequency listings

Mark Davies Mark_Davies
Thu Apr 2 00:52:50 CEST 2009

I'm looking for studies that have considered how corpus size affects the accuracy of word frequency listings.

For example, suppose that one uses a 100 million word corpus and a good tagger/lemmatizer to generate a frequency listing of the top 10,000 lemmas in that corpus. If one were to then take just every fifth word or every fiftieth word in the running text of the 100 million word corpus (thus creating a 20 million or a 2 million word corpus), how much would this affect the top 10,000 lemma list? Obviously it's a function of the size of the frequency list as well -- things might not change much in terms of the top 100 lemmas in going from a 20 million word to a 100 million word corpus, whereas they would change much more for a 20,000 lemma list. But that's precisely the type of data I'm looking for.

Thanks in advance,

Mark Davies

============================================ Mark Davies Professor of (Corpus) Linguistics Brigham Young University (phone) 801-422-9168 / (fax) 801-422-0906 Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================

More information about the Corpora mailing list