[Corpora-List] Re: problems with Google

Marian Olteanu mou_softwin at yahoo.com
Sun Mar 27 00:19:01 CET 2005


If you use counts from Google or any other search engine in your linguistic projects, the best
approach is to grab all the data needed in a small timeframe, because it is expected that the
counts after 6 months will not be the same as the counts that are today. In the best case, you
would expect a scaling of all the counts (in this case, you can perform an adjustment
automatically for the new counts to match the old counts).
More, your analysis regarding the differences between different Google servers suggests that when
you grab the data, you better grab the data from only one server, therefore you should not request
pages from www.google.com, but from 66.102.7.104 or any other Google server, even if Google is not
Saint-Vitus dancing.

--- Jean Veronis <Jean.Veronis at up.univ-mrs.fr> wrote:

> Marian Olteanu a écrit :

>

> >It looks like Google resumed the support for the wildcard ("*").

> >

> >

> >

> It goes back and forth. I posted this yesterday, but it seems that it

> did not go through. I'll try again :

>

> ---

>

> For those still interested in "Google linguistics" : Google is still

> unstable, and its various data centers are divided in three different

> groups with very different results. Clearly the update is a difficult one !

>

> If you want to see a snapshot, I made one this morning :

>

> http://aixtal.blogspot.com/2005/03/google-snapshot-of-update.html

>

> The good news is that in the DC group that seems to be updated (e.g.

> 66.102.7.104), the "*" operator seems to work as it used to.

>

> Try this search:

>

> http://66.102.7.104/search?hl=en&lr=&c2coff=1&q=%22the+*+cat%22&btnG=Search

>

> But who knows what will be the stable state if there is one, after this

> Saint-Vitus dance ?

>

> --j

> http://aixtal.blogspot.com

>

>

>

>


Marian
http://www.utdallas.edu/~mgo031000/



__________________________________
Do you Yahoo!?
Yahoo! Small Business - Try our new resources site!
http://smallbusiness.yahoo.com/resources/





More information about the Corpora-archive mailing list