[Corpora-List] Google "region"-based searches

Trevor Jenkins trevor.jenkins at suneidesis.com
Wed Nov 28 14:34:35 CET 2012


On 28 Nov 2012, at 12:56, Tristan Miller <miller at ukp.informatik.tu-darmstadt.de> wrote:


> On 28/11/12 01:25 PM, Trevor Jenkins wrote:
>> On 28 Nov 2012, at 11:48, Roland Schäfer <roland.schaefer at fu-berlin.de> wrote:
>>
>>> Whatever Google use: IP-based geolocation is totally unreliable as far
>>> as language varieties are concerned.
>>
>> Definitely. My current ISP has various nodes connecting to the Internet.
>> My connections appear to be in either Bangor in north Wales or in
>> Winchester in southern England but never where I'm actually located.
>
> I don't think you can use single cases like this to make blanket
> statements about the "total unreliability" of geolocation. Sure, the
> user of any one IP can't be pinpointed with certainty to the nearest
> square centimetre, but neither is geolocation totally random. Were we
> to analyze a large enough sample of geolocations, we could probably
> conclude that m% of all IPs can be correctly resolved geographically to
> within a n-kilometre radius.

My evidence is anecdotal for sure but I'm not using family run ISPs here. My suppliers are big boys (BT, T-Mobile, Hutchinson) in the market whether my fixed line provision or my various mobile connections; if they give false locations for me then they will give false locations for many many others too. Indeed that I'm an Englishman sitting in a buiding in England but my ISP puts my data out on to the Internet from a German IMP is highly misleading.

From the analysis of my own location based on IP address provided by these major providers your n (as in n-kilometre radius) has values between 5 and 563 with my personal connections (and the 563 traverses five countries (west Germany, Holland, Belgium, France and the UK) and the n for my prior employment with Dec/Compaq/HP is 5,280! Those are NOT insignificant values.


> For large enough areas (say, entire
> countries) the accuracy of geolocation may be high enough for one's
> purposes to make some informed estimates on the distribution of
> coarse-grained language varieties.

Entire "countries"! There are entire continents being traversed here.


> For example, given a large enough
> random sample of English texts written by people whose IPs resolve to
> Ireland, could we not reasonably expect the distribution of language
> varieties in those texts to roughly match that of the Irish population
> in general, or at least that portion of it which is online?

No! Because the assumption that RIPE and its intercontinental cohorts allocate IPv4 addresses on a strictly geographic basic is fundamentally untrue. You could not, for example, make any statements about my English use when my IP address is declared as being located in Wales. Even though I have a static IP address you still can't make any assumption about it because it might appear to be in Wales one second and then because of ISP intranet routing issues appear to be in Scotland the next and then further routing issues cause it to appear to be not in Bangor Wales but Bangor Northern Ireland before being re-routed through an IMP in Winchester England. Neither can you make any statements about my Swedish language usage when my mobile IP address is declared as being located in Germany.

And we haven't even begun to consider web hosting issues and the implications of DNS domains; for example what checks would you make of the location for David Lee's Corpora bookmarks site; originally at devoted.to now said to be at tiny.cc with .to and .cc being the TLDs for Tonga and Cocos Islands respectively.

Regards, Trevor.

<>< Re: deemed!



More information about the Corpora mailing list