[Corpora-List] Query on the use of Google for corpus research

Marco Baroni baroni at sslmit.unibo.it
Tue May 31 20:11:00 CEST 2005

> It's not much of a problem unless you presuppose that a corpus linguist

> would have difficulty finding a way to distinguish between a valid text in

> her target language and a random text generated by a spider trap.

Consider the following spider trap (quoted in the heritrix documentation):


It looks like it generates text from a unigram model, so I guess you could
use heuristics to find out that it's not true English text, e.g. using a
bigram model in some way (comparing the bigram entropy of a page with that
of a corpus of true English? Although then there is the risk that you bias
your crawl towards documents that look more like the ones in a corpus you
already have...), or using some kind of pos pattern filter (which would
require pos tagging). Perhaps, there are other heuristics that are simpler
and/or better (any suggestion?), but in any case this means that you have
to add yet another module to your corpus-crawling/processing architecture,
and if you happen to download a few gigabytes of data from sites like the
one above things can get really annoying...

Moreover, as spammers are getting smarter all the time, anti-spammers are
also becoming more sophisticated -- suppose that somebody built a spider
track by generating random _sentences_ instead of words: that would be
very hard to detect...

> > Incidentally, a "spider trap" query on google returns many more results

> > about crawlers, robots.txt files etc. than about how to capture

> > eight-legged arachnids... one good example of how one should be careful

> > when using the web as a way to gather knowledge about the world...


> I believe there's a huge difference between using the web as a way to

> gather knowledge about the world (especially if this is being done

> automatically) and using the web as a way to populate a corpus for

> linguistic research. The latter use is much less ambitious, and simply

> doesn't need to be weighed down by most of the concerns that web-mining or

> indexing applications do.

I agree that, as linguists, even if what we get is not corresponding to
the "truth" in the outside world, we do not need to worry, but factors
like the distribution of senses of a word in our corpus should be of our
concern. For example, if I were to extract the semantics of the word
"spider" from a corpus, I would rather get the eight-legged-creepy-crawly
creature reference as the central sense. In web-data, this could be tricky
(of course, I'm not saying that it would be impossible -- I'm just saying
that one should be a bit careful about what one can find in web-data...)

> Most corpus linguists who are constructing a dataset on the fly are just

> interested

I am suprised by how you seem to know so much about what corpus linguists
do and like -- personally, I am not even sure I have understood who
qualifies as a corpus linguist, yet...

> and are usually willing to add or change samples indefinitely

> until their corpus has the characteristics they need.

In my experience, adding and changing samples indefinitely until I have
about 1 billion words of web-data with the characteristics I need turns
out to be a pretty difficult thing to do... if you can suggest a procedure
to do this in an easy way, I (and, I suspect, "most corpus linguists")
would be very grateful.



More information about the Corpora-archive mailing list