[Corpora-List] Google "region"-based searches

Trevor Jenkins trevor.jenkins at suneidesis.com
Wed Nov 28 12:07:38 CET 2012


On 28 Nov 2012, at 10:15, Adam Kilgarriff <adam at lexmasterclass.com> wrote:


> Googleology is bad science.

How do we break that fallacy? In one of my other guises (as a translator) I regularly encounter people justifying their lexical choices solely on Google hit counts. They don't consider context at all just report that for the term they have specious ideological reasons for selecting "it gets X hits" and the term they don't want to use for equally specious reasons "only gets Y hits" where X > Y.

I know you (Adam) have written about this idiocy and my reaction to such "X > Y" nonsense is to refer people to your 2006 ACL paper. Sadly they remain unconvinced because of their own presuppositions. They continute to argue on the basis of Google hit counts.


> Being at the mercy of every slight change in syntax or interpretation of Google's unpublished, undocumented search syntax is horrible. We need to move to more robust, less dependent approaches. If you have a web-scale corpus on your machine, you don't need Google. ...

I'd just make do with a web-scale alternative to Google that had formally documented search syntax. Not sure how well Lucene/solr or one of the other open source search engines would scale with a database of the volume of Google's spidered content. And who has the cash to create a disk/server farm to match Google's? Perhaps there's a need for a research grant to found a project to determine the scalability of Lucene/solr and other products with a similar sized spidered dataset.

Regards, Trevor.

<>< Re: deemed!



More information about the Corpora mailing list