[Corpora-List] Corpus size and accuracy of frequency listings

Diana Santos Diana.Santos
Fri Apr 3 11:29:45 CEST 2009

Dear Mark, I don't think your question makes much sense -- possibly because you fail to explain what is the purpose of your frequency lists. I would expect that, depending on the goal, completely different answers would be possible and would make sense.

Apparently, you are concerned with ranking (not the actual frequency numbers but the order in the list). Is this right?

But what would be the purpose (or usefulness) to select every fifth occurrence of a word in a corpus? What linguistic function would that have? Certainly not a computational function (we are not in a time where we have to spare computing power in counting :-)

I would strongly suggest that if you want to reduce your corpus to a fifth, that you still keep utterances that make sense -- that is, you should keep every fifth sentence of your corpus, not word.

Also, the notion of word is quite fluid -- to say the least. So if you are working with lemmata (?), is "in spite of" a word, or three? Is "Mark Davies" a word, or two? I suppose you first lemmatize your corpus, then select... but you may be aware that these kind of decisions have an enormous impact. See Santos et al. (2003) for a detailed presentation of differences in tokenization (not even lemmatization!) between different groups in Morfolimpíadas (for Portuguese), together with quantitative data.

In any case, depending on the reason why you want the frequency lists I would suggest different ways to go/model your problem. Can you be more specific?

These references (for completely different purposes) may also help you:

Katz, Slava M. 1996. "Distribution of content words and phrases in text and language modelling", Natural Language Engineering 2 (1996), pp.15-59.

Berber Sardinha, Tony. 2000. "Comparing corpora with WordSmith Tools: How large must the reference corpus be?", in Adam Kilgarriff & Tony Berber Sardinha (eds.), Proceedings of The Workshop on Comparing Corpora, Held in conjunction with The 38th Annual Meeting of the Association for Computational Linguistics, 7 October 2000, Hong Kong University of Science and Technology (HKUST), Hong Kong, http://acl.eldoc.ub.rug.nl/mirror/W/W00/W00-0902.pdf

Evert, Stefan. 2006. "How random is a corpus? The library metaphor". Zeitschrift für Anglistik und Amerikanistik 54 (2), 177 - 190. http://purl.org/stefan.evert/PUB/Evert2006.pdf

Santos, Diana, Luís Costa & Paulo Rocha. 2003. "Cooperatively evaluating Portuguese morphology". In Nuno J. Mamede, Jorge Baptista, Isabel Trancoso & Maria das Graças Volpe Nunes (eds.), Computational Processing of the Portuguese Language: 6th International Workshop, PROPOR 2003. Faro, Portugal, June 2003 (PROPOR 2003) 2003, Berlin/Heidelberg : Springer Verlag, pp. 259-266. http://www.linguateca.pt/Diana/download/SantosCostaRochaPROPOR2003.pdf

For Portuguese, through the ACD/DC project, we have very detailed frequency lists for 22 different corpora, both for forms, for lemmata per PoS, and for lemmata irespective of PoS. You may want to consult those also to get inspiration for your hypotheses:

www.linguateca.pt/ACDC/ Choose Frequência on the lefthand side menu.

Hope to have been of some help, Greetings, Diana

> I'm looking for studies that have considered how corpus size
> affects the accuracy of word frequency listings.
> For example, suppose that one uses a 100 million word corpus
> and a good tagger/lemmatizer to generate a frequency listing
> of the top 10,000 lemmas in that corpus. If one were to then
> take just every fifth word or every fiftieth word in the
> running text of the 100 million word corpus (thus creating a
> 20 million or a 2 million word corpus), how much would this
> affect the top 10,000 lemma list? Obviously it's a function
> of the size of the frequency list as well -- things might not
> change much in terms of the top 100 lemmas in going from a 20
> million word to a 100 million word corpus, whereas they would
> change much more for a 20,000 lemma list. But that's
> precisely the type of data I'm looking for.
> Thanks in advance,
> Mark Davies
