[Corpora-List] Query on the use of Google for corpus research

Mark P. Line mark at polymathix.com
Mon May 30 16:47:00 CEST 2005

Dominic Widdows said:


> The main problem is that "using the Web" on a large scale puts you at

> the mercy of the commercial search engines, which leads to the grim

> mess that Jean documents, especially with Google.

Actually, I don't think it's really true anymore that large-scale corpus
extraction from the Web necessarily puts you at the mercy of commercial
search engines. It's no longer very difficult to throw together a software
agent that will crawl the Web directly. (IOW: The indexing part of
commercial search engines may be rocket science, but the harvesting part
of them is not.)

> This situation may hopefully change as WebCorp

> (http://www.webcorp.org.uk/) teams up with

> a dedicated search engine. In the meantime, it's clearly true that you

> can get more results from the web, but you can't vouch for them

> properly, and so a community that values both recall and precision is

> left reeling.

I think that if you describe your harvesting procedure accurately (what
you seeded it with, and what filters you used if any), and monitor and
report on a variety of statistical parameters as your corpus is growing,
there's no reason why the resulting data wouldn't serve as an adequate
sample for many purposes -- assuming that's what you meant by "vouch for
them properly".

-- Mark

Mark P. Line
San Antonio, TX

More information about the Corpora-archive mailing list