[Corpora-List] Getting articles from newspapers to compile a corpus

maxwell maxwell at umiacs.umd.edu
Thu Nov 29 21:02:50 CET 2012


On 2012-11-29 13:21, Matías Guzmán wrote:
> I was wondering if anyone knows how to get every possible article
> from online newspapers and magazines. I was thinking something like
> giving a program the URL of the newspaper (e.g. www.eltiempo.com [1])
> and getting the text from all pages therein. Is that possible?

As someone else mentioned, wget (which last I looked runs under Windows as well as under Linux) is one way to do this, assuming the newspaper has archived their old issues.

There are of course cautions. Some sites will notice that you're vacuuming everything up, and get suspicious--and they may shut you off, particularly if you just let wget run unthrottled. You'll also have cleanup to do, one aspect of which will be to check for duplicate (or near-duplicate) files. And of course there are potential copyright issues.

Mike Maxwell

University of Maryland



More information about the Corpora mailing list