[Corpora-List] Corpora for EAP: Architecture...?

Marco Baroni baroni at sslmit.unibo.it
Mon Jan 16 14:35:00 CET 2006


Hi Eric.

For smallish specialized corpora, I suppose the following Python-based
solution would work, and it probably would not take more than one day to
implement...

- Write a script to do random combinations of potentially relevant terms
from a list

- Use a python module to retrieve web pages from google via the API, e.g.:
http://pygoogle.sourceforge.net/, using each of the random combinations as
a query string

- Use the python BTE module (http://www.smi.ucd.ie/hyppia/) to clean the
pages you retrieve (it's slower than our perl implementation, but for small
corpora that should not be a problem).

- Use the NLTK or other python/java tools to process the corpus constructed
in this way

Regards,

Marco






More information about the Corpora-archive mailing list