[Corpora-List] Getting articles from newspapers to compile a corpus

Valerio Basile v.basile at rug.nl
Thu Nov 29 21:47:17 CET 2012



>> I was wondering if anyone knows how to get every possible article
>> from online newspapers and magazines. I was thinking something like
>> giving a program the URL of the newspaper (e.g. www.eltiempo.com [1])

For the Groningen Meaning Bank we downloaded approx. five years of the american online newspaper Voice of America: http://www.voanews.com/ We used wget for it, but, as Mark pointed out, it's a good practice to put a cap on the rate at which you download data from their server. One reason we choose VoA is that its text is in the public domain, that is, everyone is free to redistribute it. This is something you may want to look for if you are building a corpus, if you want to distribute the raw data along with your annotation.

What language/variety were you looking for?



More information about the Corpora mailing list