[Corpora-List] looking for dutch corpus

Martin Reynaert reynaert at uvt.nl
Thu Feb 5 14:38:28 CET 2015

Dear Marco,

The 540M word token Dutch corpus SoNaR (contemporary written Dutch) is available for free from

http://tst-centrale.org/producten/corpora/sonar-corpus/6-85 .

It is in FoLiA xml format for which at Tilburg and Nijmegen universities we are/have building/built a wide range of tools, e.g. for conversion to plain text.


Martin Reynaert

On 05/02/15 14:09, Marco Baroni wrote:
> Dear All,
> I am looking for a Dutch corpus with the following characteristics:
> - I can download it (for free or for a fee) and process it with my own
> tools (as opposed to having just online access); [I will not
> redistribute it, I will acknowledge the source in any published work
> based on it, and I will not use it for commercial purposes, so most
> licensing schemes should be viable]
> - large: ideally billions of words, minimally hundreds of millions of
> tokens;
> - not too much work to convert it to plain text (e.g., I realize that
> I could create a corpus with the desired characteristics from the
> Dutch Wikipedia, but if somebody has already done it, I'd be happy to
> avoid re-doing the pre-processing myself.
> If anybody has such a corpus, or can link/put me in touch with someone
> who does, I'll be very grateful.
> Best,
> Marco

More information about the Corpora mailing list