It seems to me that the question comes from a desire to quantify the error in the word probabilities inferred from a sample of the language. Or, conversely, to know how large does one's corpus has to be before one can have a good degree of confidence that a ranked list of 20,000 most frequent words calculated from it is sufficiently accurate for one's needs.
It is therefore a straightforward question of statistical significance is it not?
Assuming it is, then the size of the corpus is the single fundamental factor, not any characteristic of the frequency distribution of the words. If we have a 10 million word corpus and we observe word X 10,000 times, and word Y only once, we have still made precisely 10 million observations with respect to each word (some negative and some positive), and so the dependability of both estimates is the same (i.e. the variances of the observed word frequencies are some constant proportion of the true word frequencies, depending on the sample size). I might be wrong, but a little monte-carlo experiment in Excel seemed to confirm this.
All this is assuming your sample is truly random of course, which given the heterogeneity of your average corpus, is probably not true (or even perhaps meaningful to try and achieve?) And knowing that it isn't doesn't help you very much. Maybe you could try and make some estimate of the heterogeneity of the language, and therefore the reliability of this assumption, by looking at dispersion within the corpus. It's not immediately clear how you'd make use of it if you did.
It is also quite possible that I've missed the point entirely :-)
Justin Washtell University of Leeds
Quoting Diana Santos <Diana.Santos at sintef.no>:
> Dear Mark,
> I don't think your question makes much sense -- possibly because you
> fail to explain what is the purpose of your frequency lists. I
> would expect that, depending on the goal, completely different
> answers would be possible and would make sense.
> Apparently, you are concerned with ranking (not the actual frequency
> numbers but the order in the list). Is this right?
> But what would be the purpose (or usefulness) to select every fifth
> occurrence of a word in a corpus? What linguistic function would
> that have? Certainly not a computational function (we are not in a
> time where we have to spare computing power in counting :-)
> I would strongly suggest that if you want to reduce your corpus to a
> fifth, that you still keep utterances that make sense -- that is,
> you should keep every fifth sentence of your corpus, not word.
> Also, the notion of word is quite fluid -- to say the least. So if
> you are working with lemmata (?), is "in spite of" a word, or three?
> Is "Mark Davies" a word, or two? I suppose you first lemmatize your
> corpus, then select... but you may be aware that these kind of
> decisions have an enormous impact. See Santos et al. (2003) for a
> detailed presentation of differences in tokenization (not even
> lemmatization!) between different groups in Morfolimpíadas (for
> Portuguese), together with quantitative data.
> In any case, depending on the reason why you want the frequency
> lists I would suggest different ways to go/model your problem. Can
> you be more specific?
> These references (for completely different purposes) may also help you:
> Katz, Slava M. 1996. "Distribution of content words and phrases in
> text and language modelling", Natural Language Engineering 2 (1996),
> Berber Sardinha, Tony. 2000. "Comparing corpora with WordSmith
> Tools: How large must the reference corpus be?", in Adam Kilgarriff
> & Tony Berber Sardinha (eds.), Proceedings of The Workshop on
> Comparing Corpora, Held in conjunction with The 38th Annual Meeting
> of the Association for Computational Linguistics, 7 October 2000,
> Hong Kong University of Science and Technology (HKUST), Hong Kong,
> Evert, Stefan. 2006. "How random is a corpus? The library metaphor".
> Zeitschrift für Anglistik und Amerikanistik 54 (2), 177 - 190.
> Santos, Diana, Luís Costa & Paulo Rocha. 2003. "Cooperatively
> evaluating Portuguese morphology". In Nuno J. Mamede, Jorge
> Baptista, Isabel Trancoso & Maria das Graças Volpe Nunes (eds.),
> Computational Processing of the Portuguese Language: 6th
> International Workshop, PROPOR 2003. Faro, Portugal, June 2003
> (PROPOR 2003) 2003, Berlin/Heidelberg : Springer Verlag, pp.
> For Portuguese, through the ACD/DC project, we have very detailed
> frequency lists for 22 different corpora, both for forms, for
> lemmata per PoS, and for lemmata irespective of PoS. You may want to
> consult those also to get inspiration for your hypotheses:
> www.linguateca.pt/ACDC/ Choose Frequência on the lefthand side menu.
> Hope to have been of some help,
>> -----Original Message-----
>> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no]
>> On Behalf Of Mark Davies
>> Sent: 2. april 2009 00:53
>> To: corpora at uib.no
>> Subject: [Corpora-List] Corpus size and accuracy of frequency listings
>> I'm looking for studies that have considered how corpus size
>> affects the accuracy of word frequency listings.
>> For example, suppose that one uses a 100 million word corpus
>> and a good tagger/lemmatizer to generate a frequency listing
>> of the top 10,000 lemmas in that corpus. If one were to then
>> take just every fifth word or every fiftieth word in the
>> running text of the 100 million word corpus (thus creating a
>> 20 million or a 2 million word corpus), how much would this
>> affect the top 10,000 lemma list? Obviously it's a function
>> of the size of the frequency list as well -- things might not
>> change much in terms of the top 100 lemmas in going from a 20
>> million word to a 100 million word corpus, whereas they would
>> change much more for a 20,000 lemma list. But that's
>> precisely the type of data I'm looking for.
>> Thanks in advance,
>> Mark Davies
>> Mark Davies
>> Professor of (Corpus) Linguistics
>> Brigham Young University
>> (phone) 801-422-9168 / (fax) 801-422-0906
>> Web: davies-linguistics.byu.edu
>> ** Corpus design and use // Linguistic databases **
>> ** Historical linguistics // Language variation **
>> ** English, Spanish, and Portuguese **
>> Corpora mailing list
>> Corpora at uib.no
> Corpora mailing list
> Corpora at uib.no