[Corpora-List] Re: problems with Google counts

Stefan Evert evert at IMS.Uni-Stuttgart.DE
Thu Mar 17 10:17:08 CET 2005



> > (I had deleted all those messages after reading them). I tried this

> > yesterday on a couple of Spanish words (eficaz, eficiente). (By the

> > way, the results were apparently consonant with a student's search of

> > the 100,000,000 word corpusdelespaņol site.) Anyway, what repeating

> > the word apparently does is limit the results to those sites which

> > have the word at least two times, in this case cutting down on the

> > numbers by roughly 10%.

>

> Actually that's not the case. When you repeat the word, Google ranks

> first pages that contain the multiword expression you type. For example,

> if you type A B C, you'll see first pages that contain "A B C" exactly,

> if any. In the case of A A, you will see pages that contain exactly "A

> A" first, but pages where A appear only once appera later on.


Well, that can't quite be the case either, at least not today. Things
get really funny (in its "weird" sense, I'm afraid) when you start
looking for more than two repetitions. These are the numbers I just
got from Google 5 minutes ago:

3,560,000,000 the
3,600,000,000 the the
2,800,000,000 the the the
2,830,000,000 the the the the
2,820,000,000 the the the the the
etc.

When you look for non-stop-words, Google seems to make a distinction
between one occurrence and two or more occurrences:

3,110,000 fink
1,970,000 fink fink
1,970,000 fink fink fink
etc.

It would seem that in response to Jean's post, Google has changed
something to enforce consistent results (unless this is just a
side-effect of a new search engine that doesn't support wildcards).

If you go to the German Google site (www.google.de), for instance, you
will still find the old search engine in place (funny that google.de
seems to find more English pages than google.com ...):

8,000,000,000 the
88,100,000 the the
87,500,000 the the the
86,700,000 the the the the
etc.

At least we still have the wildcard "*" for an arbitrary word. For
non-stop-words, the results are consistently inconsistent:

3,460,000 fink
1,900,000 fink fink
1,920,000 fink fink fink
1,870,000 fink fink fink fink
1,910,000 fink fink fink fink fink

I am quite convinced that there is no sensible interpretation of these
queries for which the Google numbers are even remotely plausible.

Stefan.
http://wacky.sslmit.unibo.it/


--
I'm not a nerd. I'm a specialist.
-- from Full Metal Panic, Episode 8
______________________________________________________________________
Stefan Evert purl.org/stefan.evert
http://www.collocations.de/ stefan.evert at uos.de





More information about the Corpora-archive mailing list