[Corpora-List] Query on the use of Google for corpus research

Tom Emerson tree at basistech.com
Mon May 30 21:58:00 CEST 2005


Mark P. Line writes:

> There's a protocol for robotic web crawlers that you should honor, whereby

> websites can specify how they wish such crawlers to behave when their site

> is encountered during a crawl. Other than that, I wouldn't worry too much

> about traffic caused by your harvesting. Kids build web mining

> applications in Java 101 these days. Heck, they're probably doing it in

> high school. *shrug*


This is, with all due respect, a very naive thing to say. If every
research group decided to unleash impolite crawlers on the world's
websites I can guarantee that you will get a lot of hostile email very
quickly from the web masters. Writing a useful crawler is a lot more
difficult than you let on, especially if you plan on crawling a
non-trivial number of sites. As far as traffic goes, one can easily
saturate a T.3 line, bringing your local IT department down on you.


> My take is that indexing can usefully be as (linguistically or otherwise)

> sophisticated as anybody cares and has the money to make it (once you've

> actually captured the text), whereas harvesting tends to gain little from

> anything but the most rudimentary filtering.


This is also rather naive. Let's say you start a crawl with 2300 seed
URLs. How deep into a site do you go? How do you deal with spider
traps? Do you follow links outside of the seed's site? How do you
prevent yourself from crawling the same content more than once? Or
what if you want to recrawl certain sites with some regularity? What
about sites that require login or cookies? How do you schedule the
URLs to be crawled? How do you store the millions of documents that
you download?

In any event, I expect that the people behind Heritrix or UbiCrawler
or any of the other scalable, high-performance crawlers will disagree
with your glib dismissal of their area of expertise.

-tree

--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"





More information about the Corpora-archive mailing list