>> For example, given a large enough
random sample of English texts written by people whose IPs resolve to Ireland, could we not reasonably expect the distribution of language varieties in those texts to roughly match that of the Irish population in general, or at least that portion of it which is online?
I should have a 2-3 billion word corpus of English from about 20 different countries up and running in a couple of months. It's based on Google "region-based" queries (as per my original post). Once it's done, I'll look at some linguistic features where we know that a word or phrase X is much higher in country Y than in other countries, and see how well the region-based searches worked. I'll try to remember to reply back to CORPORA to let others know how it worked.
The other option, of course, is to use TLD (.lk for Sri Lanka, .sg for Singapore, .tz for Tanzania, etc), but limiting it this way *really* seems to degrade the "quality" of the web pages returned. Not as bad as if one were to limit US-based pages to .us -- where you get a lot of boring state and local government web pages -- but still not ideal. E.g., try limiting results for Tanzania to .tz or Sri Lanka to .lk -- my impression is that only a small percentage of all pages from that country have that TLD, and those pages may not be representative of the whole.
So while geolocation certainly isn't perfect, it doesn't look like a strictly TLD approach would be either.
Anyway, I'll report back on what I find with the Google region-based searches.
============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================
________________________________________ From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Tristan Miller [miller at ukp.informatik.tu-darmstadt.de] Sent: Wednesday, November 28, 2012 5:56 AM To: Corpora List Subject: Re: [Corpora-List] Google "region"-based searches
On 28/11/12 01:25 PM, Trevor Jenkins wrote:
> On 28 Nov 2012, at 11:48, Roland Schäfer <roland.schaefer at fu-berlin.de> wrote:
>> Whatever Google use: IP-based geolocation is totally unreliable as far
>> as language varieties are concerned.
> Definitely. My current ISP has various nodes connecting to the Internet.
> My connections appear to be in either Bangor in north Wales or in
> Winchester in southern England but never where I'm actually located.
I don't think you can use single cases like this to make blanket statements about the "total unreliability" of geolocation. Sure, the user of any one IP can't be pinpointed with certainty to the nearest square centimetre, but neither is geolocation totally random. Were we to analyze a large enough sample of geolocations, we could probably conclude that m% of all IPs can be correctly resolved geographically to within a n-kilometre radius. For large enough areas (say, entire countries) the accuracy of geolocation may be high enough for one's purposes to make some informed estimates on the distribution of coarse-grained language varieties. For example, given a large enough random sample of English texts written by people whose IPs resolve to Ireland, could we not reasonably expect the distribution of language varieties in those texts to roughly match that of the Irish population in general, or at least that portion of it which is online?
-- Tristan Miller, Doctoral Researcher Ubiquitous Knowledge Processing Lab (UKP-TUDA) Department of Computer Science, Technische Universität Darmstadt Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/