[Corpora-List] Google "region"-based searches

John D. Burger john at mitre.org
Tue Nov 27 20:57:44 CET 2012


Another thing that occurred to me: Mark mentioned as an example some blogspot.com sites. Blogger/blogspot has rich profiles that bloggers can fill out, including associated country, and Google may special-case these sites with that metadata.

This is perhaps even more likely for sites like Blogger, which is, in fact, a Google property.

- John Burger

MITRE

On Nov 27, 2012, at 14:44 , Mark Davies wrote:


>>> Also, and this is total speculation, but I would't be surprised if they're running a probabilistic classifier on websites to impute this information.
>
> That's my sense as well. There seem to be pages that are on domains/servers in the US or UK, but which deal with country X, and some of these are getting tagged as being from country X itself. But I need to look into this more.
>
> MD
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
>
> ________________________________________
> From: John D. Burger [john at mitre.org]
> Sent: Tuesday, November 27, 2012 12:21 PM
> To: Mark Davies
> Cc: corpora at hd.uib.no
> Subject: Re: [Corpora-List] Google "region"-based searches
>
> Mark Davies wrote:
>
>> I'm looking at creating a corpus based on the web pages from a particular country, and I'd like to use Google's advanced search "region" field to limit the pages (https://www.google.com/advanced_search, see http://www.googleguide.com/sharpening_queries.html#region). Supposedly, this limits pages based on IP address, rather than just TLD (such as .sg or .sk).
>>
>> Has anyone heard how accurate this region field is? I'm wondering, because I'm seeing links to (for example) *.blogspot.com for region-based searches from countries other than the US (e.g. Singapore or Sri Lanka). In order for Google to be accurate in these cases, presumably there are servers for blogspot.com in these other countries (or any other domain), and as people from those countries create blogs they are stored on servers in that country, and then Google is recognizing their location by IP address, rather than just the domain. And the same would hold true for any US or UK-based domain that would return results from other countries.
>
> I wouldn't assume Google is only using IP-based geolocation. Webmasters can provide region metadata to Google out-of-band:
>
> http://googlewebmastercentral.blogspot.com/2009/12/region-tags-in-google-search-results.html
>
> Also, and this is total speculation, but I would't be surprised if they're running a probabilistic classifier on websites to impute this information.
>
> - John Burger
> MITRE



More information about the Corpora mailing list