[Corpora-List] Corpus size and accuracy of frequency listings

Mark Davies Mark_Davies
Fri Apr 3 16:45:35 CEST 2009



> Dear Mark,
> I don't think your question makes much sense -- possibly because you fail to explain what is the purpose of your frequency lists.

No, I didn't give all of the relevant details in the first message. The main issue is what is a an "adequate" corpus size to create a lemma list of X number of words in a given language. If it's a top 10,000 lemma list, is 10,000,000 words adequate? Is 100,000,000 much better? The main point -- is it worth the effort to create a corpus ten times the size for only a small increase in accuracy? And I'm not just asking for the sake of curiosity -- there's an upcoming project that needs some data on this.


>> The effect of picking every 5th or 50th running word on the ranked list...

It would be every 5th or 50th word of running text *in the corpus*, *not* the ranked list. In this way, even words that occur mainly in multiword expressions should be fine. Adjacent words X1 and X2 would each be counted as would any other word. Sometimes the first word would be retrieved as we take words 1, 11, 21, 31... etc, and sometimes it would be the second word. It would never take the whole multiword expression together, of course, but then we're just after 1-grams for the lemma list (unless we *want* to preserve multiword units in the list, as in earlier versions of the BNC, for example).

And again, I'm not proposing to actually reduce a 100 million word corpus down to a 10 million word corpus -- that wouldn't make any sense. The point is whether -- for a ranked lemma list of size X -- a 10 million word corpus, for example, might be nearly as adequate as a 100 million word corpus (all other things -- genres, etc -- being equal).

Mark D.

============================================ Mark Davies Professor of (Corpus) Linguistics Brigham Young University (phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================



More information about the Corpora mailing list