[Corpora-List] Google "region"-based searches

Mark Davies Mark_Davies at byu.edu
Wed Nov 28 14:50:04 CET 2012


(Sorry if this shows up as a duplicate post. I originally sent it an hour or two ago, but it looks like it never made it through.)


>> By the way, if you want to use only Google results without crawling (BootCaT approach), you will have to pay substantial amounts of money, because they don't allow free API bulk requests anymore.

I've been running high frequency COCA 3-grams against Google for the last week or so, to create a 2-3 billion word corpus, and I've collected a bit more than 2,000,000 million URLs.

Google does ask you to identify a Captcha-like word every 3-4 hours, but as long as that's not a problem (I have my program email me as soon as it gets redirected to that page) it works fairly well.

BTW, on BootCat, I was under the assumption that they were having trouble finding a search engine that allowed a sufficient number of queries (see http://listserv.linguistlist.org/cgi-bin/wa?A2=ind1204&L=CORPORA&P=R12047). Has this been solved? It looks like the limit is about 5,000 per month (see http://listserv.linguistlist.org/cgi-bin/wa?A2=ind1207&L=CORPORA&P=R455).

Anyway, with a bit of effort it is possible to (at least partially) circumvent the Google limits, to get several million URLs per month -- if that's the route one wants to go.

MD

============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================

________________________________________ From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Roland Schäfer [roland.schaefer at fu-berlin.de] Sent: Wednesday, November 28, 2012 4:48 AM To: corpora at uib.no Subject: Re: [Corpora-List] Google "region"-based searches

I totally agree with what Adam Kilgarriff said: The problem is that nobody would want to do research on the accuracy of any Google feature, because they can change their algorithms at any moment without notice and without documentation. Results are fundamentally invalid even before you produce them.

By the way, if you want to use only Google results without crawling (BootCaT approach), you will have to pay substantial amounts of money, because they don't allow free API bulk requests anymore.

Whatever Google use: IP-based geolocation is totally unreliable as far as language varieties are concerned. If you find a document from a server located in Liverpool, are you going to treat the document as necessarily (or even potentially) containing Scouse features? Also, servers deliver different content based on undocumented mixes of various headers sent by the requester, requester IP geolocation, etc. Thus, a server located in London may deliver specialized content for US visitors, potentially written for US visitors by US authors. Google or any geolocator might even have classified the region of origin for some document correctly, but your crawler gets a different redirect to a different IP address. Automatic methods for large amounts of data will most likely never deliver reliable region identification.

If you want to deal with regional varieties in web corpora, I think two possible routes are: (1) Go for a small gold standard web corpus and try to figure out the variety spoken by the writers manually for each document. (2) Do more or less unselective crawls in the English-speaking web and then see whether the documents you get look like what you already know to be BrE and AmE, etc. Actually, top-level domains might turn out to be sort of reliable in some cases (with an accent on "sort of"), cf., e.g.:

@INPROCEEDINGS{Cook-Hirst2012,

author = {Cook, Paul and Hirst, Graeme},

title = {Do Web-Corpora from Top-Level Domains Represent National

Varieties of {English}?},

booktitle = {Proceedings of the 11th International Conference on the

Statistical Analysis of Textual Data},

year = {2012},

address = {Liège}, }

Regards, Roland

28.11.2012 11:16, corpora-request at uib.no skrev:
> Message: 3
> Date: Tue, 27 Nov 2012 14:34:10 +0000
> From: Mark Davies <Mark_Davies at byu.edu>
> Subject: [Corpora-List] Google "region"-based searches
> To: "corpora at hd.uib.no" <corpora at hd.uib.no>
>
> I'm looking at creating a corpus based on the web pages from a particular country, and I'd like to use Google's advanced search "region" field to limit the pages (https://www.google.com/advanced_search, see http://www.googleguide.com/sharpening_queries.html#region). Supposedly, this limits pages based on IP address, rather than just TLD (such as .sg or .sk).
>
> Has anyone heard how accurate this region field is? I'm wondering, because I'm seeing links to (for example) *.blogspot.com for region-based searches from countries other than the US (e.g. Singapore or Sri Lanka). In order for Google to be accurate in these cases, presumably there are servers for blogspot.com in these other countries (or any other domain), and as people from those countries create blogs they are stored on servers in that country, and then Google is recognizing their location by IP address, rather than just the domain. And the same would hold true for any US or UK-based domain that would return results from other countries.
>
> Thanks in advance,
>
> Mark Davies

_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list