I'm interested in Spanish. I already have a corpus of about 20 newspapers from Spain, and now I would like to compile corpora for a couple of countries in America. My project (it's for my MA thesis) is trying to predict from the morphosyntactic and lexical features of sentences if the sentence is a pro drop construction (so yes, I'll be using stats). I have reason to believe that the rate of pro drop varies from country to country. I already have oral corpora for Spain and Colombia, but finding for other countries has proven really difficult. I thought that newspaper corpora could be a nice way of getting documents for many different countries.
I already tried wget, it seems to work quite well, but I wasn't able to clean the html files it creates using BeautifulSoup for python. Maybe somebody know of other software capable of doing this?
2012/11/29 Linda Bawcom <linda.bawcom at sbcglobal.net>
> Dear Matias,
> I'm afraid I can't help concerning your question, but I would like to
> comment that Mike Maxwell has made a very good point regarding cleaning up
> the articles. I had a very small corpus for my doctorate of just 73
> articles about the same topic taken only from two days of various
> newspapers. Because so many newspapers get their information from the same
> news services, I found a few articles that I had to disgard because of an
> over 80% similarity ratio and of course that skews statistics. For such a
> small corpus, it was very easy to find the similarities using a plagiarism
> tool http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/ (if
> anyone is interested) -but perhaps statistics don't enter into your project.
> Kindest regards,
> Linda Bawcom
> Houston Community College-Central
> *From:* Matías Guzmán <mortem.dei at gmail.com>
> *To:* "corpora at uib.no" <corpora at uib.no>
> *Sent:* Thu, November 29, 2012 12:29:16 PM
> *Subject:* [Corpora-List] Getting articles from newspapers to compile a
> Hi all,
> I was wondering if anyone knows how to get every possible article from
> online newspapers and magazines. I was thinking something like giving a
> program the URL of the newspaper (e.g. www.eltiempo.com) and getting the
> text from all pages therein. Is that possible?
> Thanks a lot,
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3635 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121129/4c4d8118/attachment.txt>