[Corpora-List] Google "region"-based searches

egon w. stemle egon.stemle at unitn.it
Wed Nov 28 15:17:20 CET 2012


...not sure what you want to use the 'regional' feature for but - i might have an idea, and then - the following work might be of interest:

http://dl.acm.org/citation.cfm?id=2140536 Paddy WaC: a minimally-supervised web-corpus of Hiberno-English

http://www.cs.toronto.edu/~pcook/CookHirst2012.pdf Do Web Corpora from Top-Level Domains Represent National Varieties of English?

http://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf#page=31 Using Web Corpora for the Recognition of Regional Variation in Standard German Collocations

...and there is an upcoming/unpublished work about: StirWac - Compiling a web-based diverse corpus for South Tyrolean German considering genre

""" ...describe how we compiled a web-based corpus for South Tyrolean German and afterwards proceeded with a method trying to make it more diverse.

During the compilation of the corpus we had to face the problem that the variety of our specialized corpus is not limited to one top-level domain on the internet. Therefore we had to develop new strategies to narrow down our area of search. We based our work on the BootCaT tool by \citet{BaroniB04a} and used the web crawler Nutch developed by Apache additionally. After the compilation of the corpus we analysed its document distribution with a method suggested by Serge Sharoff and tried to increase the 'weakly represented areas' with similar documents again gained from the internet. """

if you find anything interesting, i'm happy to go into more details (at least with my own work...) -e.

On 2012-11-28 11:16, corpora-request at uib.no wrote:
> Date: Tue, 27 Nov 2012 14:34:10 +0000
> From: Mark Davies<Mark_Davies at byu.edu>
> Subject: [Corpora-List] Google "region"-based searches
> To:"corpora at hd.uib.no" <corpora at hd.uib.no>
>
> I'm looking at creating a corpus based on the web pages from a particular country, and I'd like to use Google's advanced search "region" field to limit the pages (https://www.google.com/advanced_search, seehttp://www.googleguide.com/sharpening_queries.html#region). Supposedly, this limits pages based on IP address, rather than just TLD (such as .sg or .sk).
>
> Has anyone heard how accurate this region field is? I'm wondering, because I'm seeing links to (for example) *.blogspot.com for region-based searches from countries other than the US (e.g. Singapore or Sri Lanka). In order for Google to be accurate in these cases, presumably there are servers for blogspot.com in these other countries (or any other domain), and as people from those countries create blogs they are stored on servers in that country, and then Google is recognizing their location by IP address, rather than just the domain. And the same would hold true for any US or UK-based domain that would return results from other countries.
>
> Thanks in advance,
>
> Mark Davies
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================



More information about the Corpora mailing list