Words is fine.
> Then, we can refer to Google basing Web1T on 10^12 words of English. Of
> course that is only what Google finds, not what is there, and it is only
> English. But they will have taken tasks like distinguishing text from
> non-text, and deduplication, seriously, which must be a good thing if =
> question is asked from a linguistic or NLP perspective.
According to http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html the Google corpus was indeed based on processing 1 (US) Trillion words (10^12 words), however there is no indication that this represents all the textual data that Google has indexed. I doubt that this is the case.
I was invited to edit a special issue of IEEE Intelligent Systems on "NLP using and for the Web" (title to be finalized) and I realized that we (or at least I) don't even know accurately how much text is on the Web. Adam, you were one of the earliest proponents of using the Web as a corpus. Do you know what is the largest corpus study (in terms of the size of the underlying data set) ever done in NLP?
> While the Berkeley reference is clearly a key one, I was surprised =
> simply at
> the extent to which it showed up more questions than answers. If that's =
> best guess (at least in 2003) at how much is out there, our collective =
> of ignorance really is stunning. (Though I can't help thinking that the =
> guys - Google, Yahoo, Microsoft, IBM - will have better answers that =
> don't publish)
> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf =
> radev at umich.edu
> Sent: 13 November 2007 17:37
> To: Constantin Orasan
> Cc: corpora at hd.uib.no
> Subject: Re: [Corpora-List] amount of text on the web?
> This is too old. I have seen this one and quoted it a lot.
> > Hi,
> > The numbers are a bit old but a very good study which investigates how
> > much data is on the web is:
> > Lyman, Peter and Hal R. Varian (2003) How much information =
> > Technical report, School of Information Management and Systems,
> > University of California at Berkeley.
> > http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/
> > Regards
> > Constantin
> > > I am looking for some up to date statistics on the amount of textual
> > > data on the web. I have seen varied estimates ranging up to 1
> > > Exabyte. I am sure that it is not possible to define precisely what
> > > "text on the web" means (do you include email, cached text, local
> > > files, "hidden" web, etc).
> > >=3D20
> > > Drago
> > --=3D20
> > Constantin Orasan <C.Orasan at wlv.ac.uk>
> > Lecturer in Computational Linguistics
> > Research Group in Computational Linguistics
> > http://www.wlv.ac.uk/~in6093/
> > University of Wolverhampton
> Dragomir R. Radev Associate Professor
> SI, CSE, Ling U. Michigan, Ann Arbor=20
> http://www.eecs.umich.edu/~radev radev at umich.edu =20
> Corpora mailing list
> Corpora at uib.no
-- Dragomir R. Radev Associate Professor SI, CSE, Ling U. Michigan, Ann Arbor http://www.eecs.umich.edu/~radev radev at umich.edu