[Corpora-List] Frequency lists (corrected)

Stefan Evert stefan.evert at uos.de
Mon Feb 23 21:29:11 CET 2009



> There is, of course, the Google language modeling data, based on over
> a trillion words worth of web pages:
>
> http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

In that context, I can't resist pointing out my signature ...

-- The wonders of Googleology (episode 1)

"from collectibles to cars"

84,700,000 -- Google

9,443,672 -- Google N-grams (Web 1T5)

1 -- ukWaC

[ stefan.evert at uos.de | http://purl.org/stefan.evert ]



More information about the Corpora mailing list