[Corpora-List] looking for dutch corpus

Khalid Choukri choukri at elda.org
Thu Feb 5 16:53:52 CET 2015


Hi Marco

Some smaller corpora are available from ELRA free of charge for example :

http://catalog.elra.info/product_info.php?products_id=764

Dutch - Het Financieele Dagblad - 1992-1993 (Samples)

The corpus contains articles from the Dutch financial newspaper Het

Financieele Dagblad editions of 2nd January 1992 through to 24th

December 1993. It contains around 8.5 million words of text.

Best regards Khalid

On 05/02/2015 14:09, Marco Baroni wrote:
> Dear All,
>
> I am looking for a Dutch corpus with the following characteristics:
>
> - I can download it (for free or for a fee) and process it with my own
> tools (as opposed to having just online access); [I will not
> redistribute it, I will acknowledge the source in any published work
> based on it, and I will not use it for commercial purposes, so most
> licensing schemes should be viable]
>
> - large: ideally billions of words, minimally hundreds of millions of
> tokens;
>
> - not too much work to convert it to plain text (e.g., I realize that
> I could create a corpus with the desired characteristics from the
> Dutch Wikipedia, but if somebody has already done it, I'd be happy to
> avoid re-doing the pre-processing myself.
>
> If anybody has such a corpus, or can link/put me in touch with someone
> who does, I'll be very grateful.
>
> Best,
>
> Marco
>
>
>
>

--

************************************************* *Khalid CHOUKRI * ELRA General Secretary & ELDA CEO email: choukri at elda.org ; Web: www.elra.info www.elda.org Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30 *************************************************** ** *Info on LREC: www.lrec-conf.org * **************************************************** **

**** -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2982 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150205/e20c6256/attachment.txt>



More information about the Corpora mailing list