[Corpora-List] Getting articles from newspapers to compile a corpus

Daniel Stein danielstein81 at gmail.com
Fri Nov 30 08:42:13 CET 2012

Dear Matías,

another tool you could use to scrape newspaper pages is scrapy which is Python based (http://scrapy.org/).

With respect to oral corpora of latinamerican Spanish I can recommend you the Hamburg Corpus of Argentinean Spanish (HaCASpa) http://www.corpora.uni-hamburg.de/sfb538/en_h9_hacaspa.html

Kind regards Daniel

2012/11/29 Matías Guzmán <mortem.dei at gmail.com>
> Thanks for all your answers :)
> I'm interested in Spanish. I already have a corpus of about 20 newspapers
from Spain, and now I would like to compile corpora for a couple of countries in America. My project (it's for my MA thesis) is trying to predict from the morphosyntactic and lexical features of sentences if the sentence is a pro drop construction (so yes, I'll be using stats). I have reason to believe that the rate of pro drop varies from country to country. I already have oral corpora for Spain and Colombia, but finding for other countries has proven really difficult. I thought that newspaper corpora could be a nice way of getting documents for many different countries.
> I already tried wget, it seems to work quite well, but I wasn't able to
clean the html files it creates using BeautifulSoup for python. Maybe somebody know of other software capable of doing this?
> Matías
-- *Daniel Stein* Universität Hamburg Hamburger Zentrum für Sprachkorpora <http://www.corpora.uni-hamburg.de/> Max-Brauer-Allee 60 22765 Hamburg Germany

