[Corpora-List] Google "region"-based searches

Roland Schäfer roland.schaefer at fu-berlin.de
Wed Nov 28 12:48:59 CET 2012


I totally agree with what Adam Kilgarriff said: The problem is that nobody would want to do research on the accuracy of any Google feature, because they can change their algorithms at any moment without notice and without documentation. Results are fundamentally invalid even before you produce them.

By the way, if you want to use only Google results without crawling (BootCaT approach), you will have to pay substantial amounts of money, because they don't allow free API bulk requests anymore.

Whatever Google use: IP-based geolocation is totally unreliable as far as language varieties are concerned. If you find a document from a server located in Liverpool, are you going to treat the document as necessarily (or even potentially) containing Scouse features? Also, servers deliver different content based on undocumented mixes of various headers sent by the requester, requester IP geolocation, etc. Thus, a server located in London may deliver specialized content for US visitors, potentially written for US visitors by US authors. Google or any geolocator might even have classified the region of origin for some document correctly, but your crawler gets a different redirect to a different IP address. Automatic methods for large amounts of data will most likely never deliver reliable region identification.

If you want to deal with regional varieties in web corpora, I think two possible routes are: (1) Go for a small gold standard web corpus and try to figure out the variety spoken by the writers manually for each document. (2) Do more or less unselective crawls in the English-speaking web and then see whether the documents you get look like what you already know to be BrE and AmE, etc. Actually, top-level domains might turn out to be sort of reliable in some cases (with an accent on "sort of"), cf., e.g.:

@INPROCEEDINGS{Cook-Hirst2012,

author = {Cook, Paul and Hirst, Graeme},

title = {Do Web-Corpora from Top-Level Domains Represent National

Varieties of {English}?},

booktitle = {Proceedings of the 11th International Conference on the

Statistical Analysis of Textual Data},

year = {2012},

address = {Liège}, }

Regards, Roland

28.11.2012 11:16, corpora-request at uib.no skrev:
> Message: 3
> Date: Tue, 27 Nov 2012 14:34:10 +0000
> From: Mark Davies <Mark_Davies at byu.edu>
> Subject: [Corpora-List] Google "region"-based searches
> To: "corpora at hd.uib.no" <corpora at hd.uib.no>
>
> I'm looking at creating a corpus based on the web pages from a particular country, and I'd like to use Google's advanced search "region" field to limit the pages (https://www.google.com/advanced_search, see http://www.googleguide.com/sharpening_queries.html#region). Supposedly, this limits pages based on IP address, rather than just TLD (such as .sg or .sk).
>
> Has anyone heard how accurate this region field is? I'm wondering, because I'm seeing links to (for example) *.blogspot.com for region-based searches from countries other than the US (e.g. Singapore or Sri Lanka). In order for Google to be accurate in these cases, presumably there are servers for blogspot.com in these other countries (or any other domain), and as people from those countries create blogs they are stored on servers in that country, and then Google is recognizing their location by IP address, rather than just the domain. And the same would hold true for any US or UK-based domain that would return results from other countries.
>
> Thanks in advance,
>
> Mark Davies



More information about the Corpora mailing list