[Corpora-List] Getting articles from newspapers to compile a corpus

Daniel Stein danielstein81 at gmail.com
Fri Nov 30 08:42:13 CET 2012


Dear Matías,

another tool you could use to scrape newspaper pages is scrapy which is Python based (http://scrapy.org/).

With respect to oral corpora of latinamerican Spanish I can recommend you the Hamburg Corpus of Argentinean Spanish (HaCASpa) http://www.corpora.uni-hamburg.de/sfb538/en_h9_hacaspa.html

Kind regards Daniel

2012/11/29 Matías Guzmán <mortem.dei at gmail.com>
>
> Thanks for all your answers :)
>
> I'm interested in Spanish. I already have a corpus of about 20 newspapers
from Spain, and now I would like to compile corpora for a couple of countries in America. My project (it's for my MA thesis) is trying to predict from the morphosyntactic and lexical features of sentences if the sentence is a pro drop construction (so yes, I'll be using stats). I have reason to believe that the rate of pro drop varies from country to country. I already have oral corpora for Spain and Colombia, but finding for other countries has proven really difficult. I thought that newspaper corpora could be a nice way of getting documents for many different countries.
>
> I already tried wget, it seems to work quite well, but I wasn't able to
clean the html files it creates using BeautifulSoup for python. Maybe somebody know of other software capable of doing this?
>
> Matías
-- *Daniel Stein* Universität Hamburg Hamburger Zentrum für Sprachkorpora <http://www.corpora.uni-hamburg.de/> Max-Brauer-Allee 60 22765 Hamburg Germany

Tel.: +49 (40) 42838-6425 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1835 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121130/45dbdb00/attachment.txt>



More information about the Corpora mailing list