I like JusText: http://code.google.com/p/justext/ (online demo: http://nlp.fi.muni.cz/projects/justext/)
I recently used it to clean about 2 billion words of web pages -- worked great.
BTW, if you're only downloading articles from 3-4 newspapers, you can usually figure out what HTML code for a particular newspaper is used to indicate the beginning and end of the "text". But for a more heterogeneous collection of texts, something like JusText is better.
============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/ ** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================
From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Matías Guzmán [mortem.dei at gmail.com] Sent: Thursday, November 29, 2012 2:54 PM To: Linda Bawcom Cc: corpora at uib.no Subject: Re: [Corpora-List] Getting articles from newspapers to compile a corpus
Thanks for all your answers :)
I'm interested in Spanish. I already have a corpus of about 20 newspapers from Spain, and now I would like to compile corpora for a couple of countries in America. My project (it's for my MA thesis) is trying to predict from the morphosyntactic and lexical features of sentences if the sentence is a pro drop construction (so yes, I'll be using stats). I have reason to believe that the rate of pro drop varies from country to country. I already have oral corpora for Spain and Colombia, but finding for other countries has proven really difficult. I thought that newspaper corpora could be a nice way of getting documents for many different countries.
I already tried wget, it seems to work quite well, but I wasn't able to clean the html files it creates using BeautifulSoup for python. Maybe somebody know of other software capable of doing this?
2012/11/29 Linda Bawcom <linda.bawcom at sbcglobal.net>
I'm afraid I can't help concerning your question, but I would like to comment that Mike Maxwell has made a very good point regarding cleaning up the articles. I had a very small corpus for my doctorate of just 73 articles about the same topic taken only from two days of various newspapers. Because so many newspapers get their information from the same news services, I found a few articles that I had to disgard because of an over 80% similarity ratio and of course that skews statistics. For such a small corpus, it was very easy to find the similarities using a plagiarism tool http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/ (if anyone is interested) -but perhaps statistics don't enter into your project.
Linda Bawcom Houston Community College-Central
From: Matías Guzmán <mortem.dei at gmail.com> To: "corpora at uib.no" <corpora at uib.no> Sent: Thu, November 29, 2012 12:29:16 PM Subject: [Corpora-List] Getting articles from newspapers to compile a corpus
I was wondering if anyone knows how to get every possible article from online newspapers and magazines. I was thinking something like giving a program the URL of the newspaper (e.g. www.eltiempo.com) and getting the text from all pages therein. Is that possible?
Thanks a lot,