[Corpora-List] Frequency lists (corrected)

Adriano Ferraresi adriano at sslmit.unibo.it
Mon Feb 23 11:52:12 CET 2009

Dear Chris,

as regards your question (i), you can find several frequency lists (for English, but also Italian and German) in this site:


The English lists were extracted from ukWaC, a very large web-derived corpus containing around 2 billion words, and are available for unigrams and bigrams. For further details please refer to the site, or have a look at: Baroni, Bernardini, Ferraresi, Zanchetta (in print). "The wacky wide web: a collection of very large linguistically processed web-crawled corpora". Language resources and evaluation.



On 23-Feb-09, at 10:50 AM, CRuehlemann at aol.com wrote:

> Dear All
> I'm interested in two questions related to word frequency lists:
> (i) The published frequency lists I am aware of include the
> following few:
> BNC-based:
> Kilgarriff, A. (1998). ‘BNC database and word frequency lists.’ http://www.kilgarriff.co.uk/bnc-readme.html
> Leech, G., P. Rayson and A. Wilson. (2001). Word Frequencies in
> Written and Spoken English: Based on the British National Corpus.
> London: Longman
> CIC-based:
> McCarthy, M. J. (1998). Spoken Language and Applied Linguistics.
> Cambridge: Cambridge University Press
> Could anybody point me to more word frequency lists available either
> in print or on the internet?
> (ii) As far as I know, the definite article the tops most word
> frequency lists derived from general corpora. Is anybody aware of
> any published in-depth discussion of this finding in terms of
> reference, be it anaphoric, cataphoric or deictic?
> Any help is greatly appreciated. A summary will be posted.
> Chris
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9007 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20090223/60f5258e/attachment.txt

More information about the Corpora mailing list