That's my sense as well. There seem to be pages that are on domains/servers in the US or UK, but which deal with country X, and some of these are getting tagged as being from country X itself. But I need to look into this more.
============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================
________________________________________ From: John D. Burger [john at mitre.org] Sent: Tuesday, November 27, 2012 12:21 PM To: Mark Davies Cc: corpora at hd.uib.no Subject: Re: [Corpora-List] Google "region"-based searches
Mark Davies wrote:
> I'm looking at creating a corpus based on the web pages from a particular country, and I'd like to use Google's advanced search "region" field to limit the pages (https://www.google.com/advanced_search, see http://www.googleguide.com/sharpening_queries.html#region). Supposedly, this limits pages based on IP address, rather than just TLD (such as .sg or .sk).
> Has anyone heard how accurate this region field is? I'm wondering, because I'm seeing links to (for example) *.blogspot.com for region-based searches from countries other than the US (e.g. Singapore or Sri Lanka). In order for Google to be accurate in these cases, presumably there are servers for blogspot.com in these other countries (or any other domain), and as people from those countries create blogs they are stored on servers in that country, and then Google is recognizing their location by IP address, rather than just the domain. And the same would hold true for any US or UK-based domain that would return results from other countries.
I wouldn't assume Google is only using IP-based geolocation. Webmasters can provide region metadata to Google out-of-band:
Also, and this is total speculation, but I would't be surprised if they're running a probabilistic classifier on websites to impute this information.
- John Burger