[Corpora-List] Google "region"-based searches

Mike Maxwell maxwell at umiacs.umd.edu
Wed Nov 28 23:58:51 CET 2012


On 11/28/2012 8:40 AM, Mark Davies wrote:
> The other option, of course, is to use TLD (.lk for Sri Lanka, .sg for Singapore, .tz for
> Tanzania, etc), but limiting it this way *really* seems to degrade the "quality" of the web pages
> returned. Not as bad as if one were to limit US-based pages to .us -- where you get a lot of
> boring state and local government web pages -- but still not ideal. E.g., try limiting results
> for Tanzania to .tz or Sri Lanka to .lk -- my impression is that only a small percentage of all
> pages from that country have that TLD, and those pages may not be representative of the whole.

This is familiar. Eight or so years ago, we were looking for Tagalog pages, and briefly thought about using these codes to confine our searches to the Philippines. Both precision and recall were terrible: precision because there were lots of English-language websites in the Philippines (not to mention lots of other Philippine languages), and recall for the reason given above.

I wrote some of this up in a paper given at ALLC/ ACH in 2004. It was however about finding web pages in non-English languages, the methods probably wouldn't help if you're looking for dialectal English. --

Mike Maxwell

maxwell at umiacs.umd.edu

"My definition of an interesting universe is

one that has the capacity to study itself."

--Stephen Eastmond



More information about the Corpora mailing list