[Corpora-List] Getting articles from newspapers to compile a corpus

Toddy Mladenov me at toddysm.com
Thu Nov 29 19:57:41 CET 2012


If you use NLTK there is special module that allows you to grab the HTML from URL, strip out all the tags and get the text only.

Is this what you are looking for? On Nov 29, 2012 10:21 AM, "Matías Guzmán" <mortem.dei at gmail.com> wrote:


> Hi all,
>
> I was wondering if anyone knows how to get every possible article from
> online newspapers and magazines. I was thinking something like giving a
> program the URL of the newspaper (e.g. www.eltiempo.com) and getting the
> text from all pages therein. Is that possible?
>
> Thanks a lot,
>
> Matías
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1240 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121129/5bf4d351/attachment.txt>



More information about the Corpora mailing list