[Corpora-List] Query on the use of Google for corpus research

TadPiotr tadpiotr at plusnet.pl
Fri May 27 21:39:00 CEST 2005

That is very interesting, and, well, sad: it does not take too much of one's
time to read work of people who are supposed to be specialist on language(s)
, but people from one camp do not seem to care for the other camp.
The paper of the two scholars is available at
http://www.arxiv.org/abs/cs.CL/0412098, they say modestly in the abstract
that" The approach is novel in its unrestricted problem domain, simplicity
of implementation, and manifestly ontological underpinnings.". The last
statement is even more interesting in that one of the authors, Rudi
CilibrasiI, says in the answer to a blog: "There have been several
philosopher's interested in this research. Somebody else mentioned Derrida
too. I must confess my philosophical background is weak and I just now tried
to look up Derrida on the Wikipedia.:
There is something deeply wrong...
I had a look at the references in the paper, and you cannot say they did not
refer to linguistic literature (or what they take to be linguistic
literature): there is a reference to a note on corpora in The Economist.
Did I say something seems to be wrong?
Best wishes,
Tadeusz Piotrowski

> -----Original Message-----

> From: owner-corpora at lists.uib.no

> [mailto:owner-corpora at lists.uib.no] On Behalf Of Dominic Widdows

> Sent: Friday, May 27, 2005 3:46 PM

> To: Jean.Veronis at up.univ-mrs.fr

> Cc: corpora at uib.no; Peter K Tan; ellmml at nus.edu.sg

> Subject: Re: [Corpora-List] Query on the use of Google for corpus

> research


> >> Does anyone have any

> >> experience/insight on this?

> >>

> >

> > Well... yes! I made a series of in-depth analyses of Google

> counts.

> > They are totally bogus, and unusable for any kind of

> serious research.

> > There is a summary here :

> > http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-

> > mystery.html


> Dear All,


> While I agree with the points made in Jean's excellent summary, I

> think it's fair to point out that this was partly motivated by the way

> researchers had been using "Google counts" more and more, and coming

> up with more and more problems. As a community of researchers and

> peer-reviewers, I still don't think that we've been able to agree on

> best practices. I have come across reviews on both sides of the fence,

> saying on the one hand:


> 1. Your method didn't get a very big yield on your fixed corpus, why

> didn't you use the Web?


> or on the other:


> 2. Your use of web search engines to get results is unreliable, you

> should have used a fixed corpus.


> The main problem is that "using the Web" on a large scale puts you at

> the mercy of the commercial search engines, which leads to the grim

> mess that Jean documents, especially with Google. This situation may

> hopefully change as WebCorp

> (http://www.webcorp.org.uk/) teams up with a dedicated search engine.

> In the meantime, it's clearly true that you can get more results from

> the web, but you can't vouch for them properly, and so a community

> that values both recall and precision is left reeling.


> At the same time, the fact that you can use search engines to get a

> rough count of language use in many cases has thrown the door open to

> a lot of researchers who have every reason to be interested in

> language as a form of data, but have never tried doing much language

> processing before. Over the decades, linguists have often been very

> sniffy about researchers from other disciplines muscling in out their

> turf, but this often results in articles that talk about language just

> getting published elsewhere (e.g. in more mainstream media), where the

> reviewers are perhaps more favourable. A recent and typical example

> may be the "Google Distance" hype

> (http://www.newscientist.com/article.ns?id=dn6924) - we've had

> conceptual distance, latent semantic analysis, mutual information,

> etc.

> for decades, a couple of mathematicians come along and call something

> the "Google distance", and the New Scientist magazine concludes that

> the magic of Google has made machines more intelligent.


> All right, there's a trace of bitterness here, I wouldn't mind being

> in New Scientist for computing semantic distances, but there's a more

> serious danger as well - we've been doing a lot of pretty good work

> for a long while in different areas of corpus and computational

> linguistics, and it would be a shame if other folks went off and

> reinvented everything, just because there are more widely available

> tools that enable a wider community to "give it a go" and come up with

> something that may do pretty well, especially if you're going for

> recall. It breaks come fundamental principles such as "do your

> experiments in a way that others can replicate them", but this is

> naturally on the increase as big-dataset empiricism comes to the

> forefront of many scientific problems. For example, there's the recent

> research in ocean temperatures that made 7 million temperature

> readings at different stations, and none of us can go and replicate

> that data collection, but it doesn't invalidate the research.


> If we just tell people that search-engine based research is bogus,

> people will just keep doing it and publishing it elsewhere, and who

> knows, in 10 years time someone using Google or Yahoo counts may

> invent part-of-speech tagging, and that will be another amazing thing

> that makes computers more intelligent.


> Sorry, I haven't got any answers, but I'm writing this in the hope

> that someone else on the list has!

> Best wishes,

> Dominic



More information about the Corpora-archive mailing list