[Corpora-List] Re: problems with Google counts

Jean Veronis Jean.Veronis at up.univ-mrs.fr
Thu Mar 17 08:49:06 CET 2005


FIDELHOLTZ_DOOCHIN_JAMES_LAWRENCE a écrit :


> Hi, Corpora Guys,

> Sorry I don't remember who wrote suggesting simply repeating the word

> in Google to get a supposedly more realistic count of pages with the

> word in it


Me ;-)

http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-mystery.html


> (I had deleted all those messages after reading them). I tried this

> yesterday on a couple of Spanish words (eficaz, eficiente). (By the

> way, the results were apparently consonant with a student's search of

> the 100,000,000 word corpusdelespañol site.) Anyway, what repeating

> the word apparently does is limit the results to those sites which

> have the word at least two times, in this case cutting down on the

> numbers by roughly 10%.


Actually that's not the case. When you repeat the word, Google ranks
first pages that contain the multiword expression you type. For example,
if you type A B C, you'll see first pages that contain "A B C" exactly,
if any. In the case of A A, you will see pages that contain exactly "A
A" first, but pages where A appear only once appera later on.


> If that is what is happening, this implies serious problems for

> relatively rare words, which may not occur twice in very many pages at

> all. At any rate, the decrease in pages encountered seemed to be

> about the same proportionately in both cases. (We're talking here

> about roughly 1.5M original hits.) If I'm missing the point of the

> suggestion, please straighten me out.

>

I think you'll find the whole logic explained in my post cite above.
Google counts were inflated artifically by 66%. Therefore, proportions
stay identical.

However, if you test Google again these days, you will see MAJOR changes
in the counts. My post did a lot of noise (it was written in early
February). It has been relayed on many forums, etc. and I know that the
Googlers have read it with great care (and other search engine makers as
well ;-). In February that have started making major changes in the
counts in order to reduce the inconsistencies I have spotted -- and
close the backdoors they had left open inadvertendly.

Just to give an example. when you typed "the" previously, you used to get

* 8 billions for "all the web"
* 80 millions for "the" restricted to English pages

i.e; 1% which doesn't make sense.

This morning, I tried again, and I get 3.6 billions in both cases, which
does make sense. (this can change again if you try: for a week or so,
Google is totally unstable, due to the major update process).

I explained these recent changes last week at:

http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html

Since then, more changes have occured. Google tries to get close to
credible figures. I am afraid that it's not the index that's fixed, but
jus the extrapolation formulas. In any case, we will never know, and
that's the problem. you can't do science with instruments you don't
understand and can't trust.

By the way, Yahoo gives very reliable and consistent results (including
for booleans), which I have cross-checked with English and French
corpora. The only problem was it lack of the wildcard operator, but
Google dropped it.

I personnaly use it quite satifactorily -- so far:

http://aixtal.blogspot.com/2005/02/lexique-yahoo-et-les-yahoourts.html
http://aixtal.blogspot.com/2005/03/lexique-glissance-et-pntrance.html
(in French and on French, sorry)

And they released a very nice API which enables getting 25 times more
results that Google (5000 queries a day x 50 results a page, instead of
1000 x 10 for Google). However, I hope Yahoo won't start playing weird
marketing games too:
http://aixtal.blogspot.com/2005/03/web-yahoo-doubles-its-counts.html

--j
http://aixtal.blogspot.com









More information about the Corpora-archive mailing list