[Corpora-List] Getting articles from newspapers to compile a corpus

Khalid CHOUKRI choukri at elda.org
Thu Nov 29 22:16:13 CET 2012


Hi Matías

which languages and domains are you looking for and what sizes? and are you looking for monolingual data? ELRA regularly collects such data (after negotiating the rights), we may have something to share with you. Best regards Khalid

Matías Guzmán wrote, On 29/11/2012 19:21:
> Hi all,
>
> I was wondering if anyone knows how to get every possible article from
> online newspapers and magazines. I was thinking something like giving a
> program the URL of the newspaper (e.g. www.eltiempo.com) and getting the
> text from all pages therein. Is that possible?
>
> Thanks a lot,
>
> Matías
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- *Khalid Choukri * ELRA General secretary & ELDA CEO email: choukri at elda.org; Web: www.elra.info www.elda.org Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30

**************************************************** ** Info on LREC 2012 : www.lrec-conf.org *************************************************** * -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2382 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20121129/f0da932b/attachment.txt>



More information about the Corpora mailing list