[Corpora-List] Query on the use of Google for corpus research

Marco Baroni baroni at sslmit.unibo.it
Tue May 31 10:51:01 CEST 2005



> > How do you deal with spider traps?

>

> Why would spider traps be a concern (apart from knowing to give up on the

> site if my IP address has been blocked by their spider trap) when all I'm

> doing is constructing a sample of text data from the Web?


First of all, your crawler has to understand that it fell into a trap.
Second, some spider traps generate dynamic pages containing random text
for you to follow -- now, that's a problem if you're trying to build a
linguistic corpus, isn't it?

Incidentally, a "spider trap" query on google returns many more results
about crawlers, robots.txt files etc. than about how to capture
eight-legged arachnids... one good example of how one should be careful
when using the web as a way to gather knowledge about the world...

Regards,

Marco





More information about the Corpora-archive mailing list