[Corpora-List] looking for dutch corpus

Martin Reynaert reynaert at uvt.nl
Thu Feb 5 14:38:28 CET 2015


Dear Marco,

The 540M word token Dutch corpus SoNaR (contemporary written Dutch) is available for free from

http://tst-centrale.org/producten/corpora/sonar-corpus/6-85 .

It is in FoLiA xml format for which at Tilburg and Nijmegen universities we are/have building/built a wide range of tools, e.g. for conversion to plain text.

Enjoy!

Martin Reynaert

On 05/02/15 14:09, Marco Baroni wrote:
> Dear All,
>
> I am looking for a Dutch corpus with the following characteristics:
>
> - I can download it (for free or for a fee) and process it with my own
> tools (as opposed to having just online access); [I will not
> redistribute it, I will acknowledge the source in any published work
> based on it, and I will not use it for commercial purposes, so most
> licensing schemes should be viable]
>
> - large: ideally billions of words, minimally hundreds of millions of
> tokens;
>
> - not too much work to convert it to plain text (e.g., I realize that
> I could create a corpus with the desired characteristics from the
> Dutch Wikipedia, but if somebody has already done it, I'd be happy to
> avoid re-doing the pre-processing myself.
>
> If anybody has such a corpus, or can link/put me in touch with someone
> who does, I'll be very grateful.
>
> Best,
>
> Marco
>
>
>
>



More information about the Corpora mailing list