[Corpora-List] problems with Google counts

Nancy Ide ide at cs.vassar.edu
Wed Mar 16 19:51:05 CET 2005


Are people aware of the Linguist's Search Engine developed at
University of Maryland, for doing linguistic searches on internet data?
URL is http://lse.umiacs.umd.edu

On Mar 16, 2005, at 1:26 PM, Ring Low wrote:


> A few years ago I did a study of the uses of the definite article THE

> in English using Google search (the data was collected in 2003). I

> used Internet search engine to conduct the study partially because I

> wanted to get the page-counts, which would exclude repeat instances in

> the same text (i.e., rather than the absolute frequencies).

> I gathered about 1500 nouns and put it into the search engine using

> two strings "the * N" and "the N". I also did the same for other

> pre-nominal elements such as "a", "this", "that", "my", "his", "her".

> Other criteria I used at that time were "in text only" and "English

> only".

>

> The inconsistency I found, at that time, was that the sum of the

> frequencies I obtained for all the nouns with one element is always

> much more than the frequency reported in a single search for that

> element, i.e., the sum of all "the N" was much larger than the search

> of the word "the" alone in the Google database, which did puzzle me.

>

> On the other hand, I did find some consistencies on the data. First,

> the ratio of the frequencies among each search are always about the

> same, even I did all the search a couple times among several months.

> In addition, the relative frequencies among the nouns at that time, as

> far as the ones that I could check, was consistent with the data I

> found in some other corppora I found (e.g., if one find that a word is

> of a relatively high frequency in Google, one would also find that

> word having a relative high frequency in other texts).

> I agree that using Google to conduct linguistic studies has gotten

> more and more difficult since then, as the design of the search engine

> has been changing due to commercial reasons. We do need a search

> engine design specically for linguistic studies. On the other hand,

> before such a search engine is available, some other ways to avoid

> problmetic results might be to adjust the design of the study

> according to some known weaknesses of the engine and to cross-check

> the results manually with tranditional corpora and other search

> engines.

>

>

>

> --

> ==============================

> Ring Low

> mlow at acsu.buffalo.edu

> http://www.acsu.buffalo.edu/~mlow/

> ==============================

>

>

>

> Lillian Lee wrote:

>

>> Dear list members,

>>

>> You might be interested to know that until approximately March 8th,

>> Google counts appear to have been quite off (inflation rates of a

>> factor of 66%?), according to Jean Veronis.

>>

>> In a blog post of February 8th

>> (

>> http://aixtal.blogspot.com/2005/02/web-googles-missing-pages-

>> mystery.html ),

>> Veronis summarized his earlier findings:

>>

>> # If you type Chirac OR Sarkozy, you get half the number results of

>> Chirac alone, which may have a political explanation... but is a

>> weird approach to boolean logic.

>>

>> # If you search the in the English pages, you get 1% of the number

>> you get for the all languages together. Does this mean that the is

>> 99 times more frequent in languages other than English? Of course

>> not.

>>

>> He gave a possible explanation and noted that "if you want to know the

>> real index count for any word, simply type it twice".

>>

>> On March 13th, he noted that the counts seem to have been adjusted,

>> that is "changed in a major way":

>> http://aixtal.blogspot.com/2005/03/web-google-adjusts-its-counts.html

>>

>> Related posts indicate problems with MSN, the possibility that Yahoo

>> indexes more pages than Google, and more details on calculations.

>> ________________________________________________________________

>> Lillian Lee, Assoc. Prof. tel: 607-255-8119

>> Dept of Computer Science fax: 607-255-4428 Cornell University

>> llee at cs.cornell.edu Ithaca, NY 14853-7501 USA

>> www.cs.cornell.edu/home/llee

>> ________________________________________________________________

>>

>>

>>

>>

>>

>>

>

>

>

>

=======================================================

Nancy Ide

Professor of Computer Science
Vassar College
Poughkeepsie, NY 12604-0520 USA
Tel: +1 845 437-5988 Fax: +1 845 437-7498
ide at cs.vassar.edu

Chercheur Associe
Equipe Langue et Dialogue, LORIA/CNRS
Campus Scientifique - BP 239
54506 Vandoeuvre-les-Nancy FRANCE
Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
ide at loria.fr

=======================================================






More information about the Corpora-archive mailing list