[Corpora-List] estimates of written/spoken input: summary

Marco Baroni baroni at sslmit.unibo.it
Fri Dec 9 10:50:01 CET 2005


Dear all,

Two weeks ago I asked if somebody knew of work reporting estimates of how
many words/sentences/etc. (adult) speakers of a language hear/write.

I paste below the responses I got.

Thanks a lot to all who responded!

Regards,

Marco


******************************************
Reinhard Rapp
******************************************

Dear Marco,

I am also interested in the answer to your question. Some discussion
can be found in a Psychological Review paper by Landauer & Dumais
(1997) which is on the web at

http://lsa.colorado.edu/papers/plato/plato.annote.html

This is a citation from the most relevant part, which is footnote 6:

----------- start citation ------------


> From his log-normal model of word frequency distribution and the

observations in Carroll et al.

(1971), Carroll estimated a total vocabulary of 609,000 words in the
universe of text to which students through highschool might be exposed.
Dahl (1979), whose distribution function agrees with a different but
smaller sample of Howes (1966), found 17,871 word types in 1,058,888 tokens
of spoken American English, compared to 50,406 in the comparable sized
adult sample of Kucera & Francis (1967). By Carroll's (1971) model, Dahl's
data imply a total of roughly 150,000 word types in spoken English, thus
approximately one-fourth the total, less to the extent that there are
spoken words that do not appear in print. Moreover, the ratio of spoken to
printed words to which a particular individual is exposed must be even more
lopsided because local, ethnic and family usage undoubtedly restrict the
variety of vocabulary more than published works intended for the general
school-aged readership.
If we assume that our seventh-grader has met a total of 50 million word
tokens of spoken English (140 minutes a day at 100 words per minute for 10
years) then the expected number of occasions on which the she would have
heard a spoken word of mean frequency would be about 370. Carroll's
estimate for the total vocabulary of seventh grade texts is 280,000, and we
estimate below that the typical student would have read about 3.8 million
words of print. Thus, the mean number of times she would have seen a
printed word to which she might be exposed is only about 14. The rest of
the frequency distributions for heard and seen words, while not
proportional, would, at every point, show that spoken words have already
had much greater opportunity to be learned than printed words, so will
profit much less from an additional occurrence.

----------- end citation ------------

...

With kind regards,

Reinhard



******************************************
Paula Newman
******************************************

Marco,
That's an interesting question. A little googling suggested that a lower
bound might come from data on the average number of hours of TV watching
per adult (multiplied by average words per minute on TV broadcasts).
Paula



******************************************
Paul Bennett
******************************************


Geoffrey Pullum and Barbara Scholze (in Linguistic Review 19, 2002, p44) cite
evidence that by the age of three a child in a professional household might
have heard 30 million word tokens (but far fewer for children in other social
classes). I know this relates to children rather than adults, but presumably
the amount of language heard does not differ much by age.

Their source is B. Hart and T. Risley: Meaningful Differences in the Everyday
Experiences of Young Children (Paul H Brookes, 1995). I haven't read this, but
I guess this would be a place to look for more information.

Paul Bennett



******************************************
Ilana Bromberg
******************************************


Marco,

There is some information regarding how much school-age children (up
through HS I think) read in the following article. It's possible that some
of the sources they cite may have more information about adults.

Landuaer, Thomas K and Dumais, Susan T. 1997. A Solution to Plato's
Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction,
and Representation of Knowledge. Psychological Review, 104:2, 211-240.

Good luck,
Ilana



--
Marco Baroni
SSLMIT, University of Bologna
http://sslmit.unibo.it/~baroni






More information about the Corpora-archive mailing list