[Corpora-List] Corpus Mining
baroni at sslmit.unibo.it
Wed Dec 8 08:38:01 CET 2004
The CorpusBuilder tool was the main inspiration for BootCaT:
It is intended for the collection of texts in a specific language,
rather than about a specific topic, but I suppose it could be tweaked
to look for specialized texts.
CorpusBuilder was (is?) part of a larger project about acquiring
knowledge from the web:
An Crúbadán is another tool for language-specific web-corpus mining,
that perhaps could be tweaked to sub-language mining:
Somewhat relevant is also the notion of ``focused crawling'' in
information retrieval, see e.g.
University of Bologna
More information about the Corpora-archive