[Corpora-List] Getting articles from newspapers to compile a corpus

Linda Bawcom linda.bawcom at sbcglobal.net
Thu Nov 29 22:28:40 CET 2012

Dear Matias,

I'm afraid I can't help concerning your question, but I would like to comment that Mike Maxwell has made a very good point regarding cleaning up the articles.  I had a very small corpus for my doctorate of just 73 articles about the same topic taken only from two days of various newspapers.  Because so many newspapers get their information from the same news services, I found a few articles that I had to disgard because of an over 80%  similarity ratio and of course that skews statistics. For such a small corpus, it was very easy to find the similarities using a plagiarism tool http://plagiarism.bloomfieldmedia.com/z-wordpress/software/wcopyfind/  (if anyone is interested) -but perhaps statistics don't enter into your project.

Kindest regards,

Linda Bawcom Houston Community College-Central

________________________________ From: Matías Guzmán <mortem.dei at gmail.com> To: "corpora at uib.no" <corpora at uib.no> Sent: Thu, November 29, 2012 12:29:16 PM Subject: [Corpora-List] Getting articles from newspapers to compile a corpus

Hi all,

I was wondering if anyone knows how to get every possible article from online newspapers and magazines. I was thinking something like giving a program the URL of the newspaper (e.g. www.eltiempo.com) and getting the text from all pages therein. Is that possible?

Thanks a lot,

Matías -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2241 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121129/0c465137/attachment.txt>

More information about the Corpora mailing list