I have some data from the Birmingham Collection of English Text (18m; c 1986) and the Bank of English corpus (418m; c 2000) which may be relevant to your question.
Unfortunately this comparison is very inexact. The 2 corpora were compiled 14 years apart, using different design policies, data collection strategies and procedures, and different technologies; the corpora differ substantially in composition; and the frequencies were based on different tokenization principles, etc etc.
Also, I do not have lemmatized frequencies to offer, only type frequencies. And I only have the examples given below, and cannot generate any new lists.
However, the fact that there were (albeit small) changes in rank even in the top 10 items of the type frequency lists suggests that effects of corpus size on lemmas lower down the lists could be substantial:
CORPUS
18m
418m
the
1,081,654
22,849,031
of
535,391
10,551,630
and
511,333
9,787,093
to
479,191
10,429,009
a
419,798
9,279,905
in
334,183
7,518,069
that
215,332
4175495
s
4072762
is
3900784
it
198,578
3771509
for
3690466
i
197,055
3216005
was
194,286
3092967
An inspection of some random types at various levels in the lists seems to bear this out. By rank 5000 in the 18m corpus, we see variations of 5000+ ranks in the 418m corpus (i.e. from 'prey' downwards):
CORPUS
18m
418m
RANK
FREQ
RANK
FREQ
been
48
48,068
47
1,019,904
people
75
26,057
72
610,679
how
94
20,906
104
393,586
going
129
14,924
147
288,607
away
150
12,168
225
185,260
house
176
9,890
206
198,592
widely
2,500
660
2,486
17,804
prey
5,000
280
9,211
3,185
fulfilment
10,000
107
15,122
1,506
balloon
15,000
58
9,011
3,298
compromises
20,000
37
16,395
1,327
scenic
25,000
26
15,651
1,429
fungal
40,000
11
25,633
628
peyote
70,000
4
58,153
129
I do not know what would happen if you (for example) extracted a subset of complete texts from a 100m corpus to form a 10m corpus or 1m corpus. But perhaps this exercise has in effect been conducted already with the BNC, when they produced the Sampler, World Edition, etc? This would at least reduce many of the differences between BCET and BoE that I mentioned earlier. And perhaps the relevant lemma lists already exist?
Your proposal of selecting every 10th running word from the texts in a 100m corpus to create a '10m corpus' would imply approximately even distribution of types across the 100m corpus?
You mention multiword items in your email, but wouldn't your proposed procedure deny any generic or systemic effect of the collocational and phraseological tendencies of language on the frequency of individual types (which would be further affected by lemmatization)?
Also, wouldn't it affect different types/lemmas differently? For example, the high frequency of the content word/type 'time' in any general corpus of English must be greatly affected by its occurrence in many common phrases? Whereas the content word/type 'people' (usually also of similarly high frequency) might participate less in phrases, and be used more in isolated contexts, and hence be less afected?
Creating lemmatized frequency lists of a 10m corpus created in this way would imply that the members of each lemma were also distributed roughly evenly across the 100m corpus?
I have of course until now by-passed a major linguistic issue: which definition of lemma you are using, and how that affects any lemmatized frequency lists produced.
Although I feel neither mathematically nor linguistically competent to say much more without further evidence and discussion, wouldn't it be relatively straightforward (computationally) to implement your proposal on existing corpora? I would certainly be very interested to know the results!
Best Ramesh
Ramesh Krishnamurthy Lecturer in English Studies, School of Languages and Social Sciences, Aston University, Birmingham B4 7ET, UK Tel: +44 (0)121-204-3812 ; Fax: +44 (0)121-204-3766 [Room NX08, 10th Floor, North Wing of Main Building] http://www1.aston.ac.uk/lss/staff/krishnamurthyr/ Director, ACORN (Aston Corpus Network project): http://acorn.aston.ac.uk/ Date: Fri, 3 Apr 2009 08:45:35 -0600 From: Mark Davies <Mark_Davies at byu.edu> Subject: Re: [Corpora-List] Corpus size and accuracy of frequency
listings To: "corpora at uib.no" <corpora at uib.no>
> Dear Mark,
> I don't think your question makes much sense -- possibly because you fail to explain what is the purpose of your frequency lists.
No, I didn't give all of the relevant details in the first message. The main issue is what is a an "adequate" corpus size to create a lemma list of X number of words in a given language. If it's a top 10,000 lemma list, is 10,000,000 words adequate? Is 100,000,000 much better? The main point -- is it worth the effort to create a corpus ten times the size for only a small increase in accuracy? And I'm not just asking for the sake of curiosity -- there's an upcoming project that needs some data on this.
>> The effect of picking every 5th or 50th running word on the ranked list...
It would be every 5th or 50th word of running text *in the corpus*, *not* the ranked list. In this way, even words that occur mainly in multiword expressions should be fine. Adjacent words X1 and X2 would each be counted as would any other word. Sometimes the first word would be retrieved as we take words 1, 11, 21, 31... etc, and sometimes it would be the second word. It would never take the whole multiword expression together, of course, but then we're just after 1-grams for the lemma list (unless we *want* to preserve multiword units in the list, as in earlier versions of the BNC, for example).
And again, I'm not proposing to actually reduce a 100 million word corpus down to a 10 million word corpus -- that wouldn't make any sense. The point is whether -- for a ranked lemma list of size X -- a 10 million word corpus, for example, might be nearly as adequate as a 100 million word corpus (all other things -- genres, etc -- being equal).
Mark D.
============================================ Mark Davies Professor of (Corpus) Linguistics Brigham Young University (phone) 801-422-9168 / (fax) 801-422-0906
http://davies-linguistics.byu.edu<http://davies-linguistics.byu.edu/>
** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 70707 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20090405/9a21f609/attachment.txt