[Corpora-List] Query on the use of Google for corpus research

Tom Emerson tree at basistech.com
Mon May 30 21:45:01 CEST 2005

Dominic Widdows writes:

> Is there good reliable software out there, for those who would still be

> fearful of hacking up a harvester for themselves?

> There is the Internet Archive's Heritrix crawler

> (http://crawler.archive.org/). Has anyone used this and found it

> suitable for linguistic purposes?

Yes, I use it for large scale crawls for linguistic research, and will
be presenting some of my work at the "Web as Corpus" workshop being
held with Corpus Linguistics 2005. Heritrix is an outstanding piece of

> This still leaves some of the traditional benefits of corpora

> unaccounted for - what about normalising the text content (presuming

> the traditional notion that text content is the linguistics phenomenon

> you're interested in), tagging, perhaps getting all the data into the

> same character set, etc.? These are some of the creature comforts that

> organizations such as the LDC have traditionally provided. We can


And these are the dirty little details that most researchers just wave
off with a swish of their hand. When it comes down to it, crawling
data is only a small part of the problem.


Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"

More information about the Corpora-archive mailing list