[Corpora-List] looking for dutch corpus

Uwe Quasthoff quasthoff at informatik.uni-leipzig.de
Thu Feb 5 16:33:30 CET 2015

Dear Marco,

may I draw your attention to the Leipzig Corpora Collection http://corpora.informatik.uni-leipzig.de/ ? There are corpora in more than 200 languages, incuding Dutch. The corpora are sentence separated and sentence scrambled. Each sentence comes with a source URL. Some more data like word frequencies and word cooccurrences are included. Smaller corpora (up to 1 million sentences) can be downloaded directly as plain text or MySQL database files at http://corpora.informatik.uni-leipzig.de/download.html Larger coropra are available on request for non academic purposes without any charge.

Our largest Dutch corpus consists of more than 70 million sentences with about 1.1 billion running words. Please have a look at our dutch data and feel free to contact me.


Uwe Quasthoff

NLP Group Dept. of Computer Science Leipzig University Germany

Am 05.02.2015 um 14:09 schrieb Marco Baroni:
> Dear All,
> I am looking for a Dutch corpus with the following characteristics:
> - I can download it (for free or for a fee) and process it with my own
> tools (as opposed to having just online access); [I will not
> redistribute it, I will acknowledge the source in any published work
> based on it, and I will not use it for commercial purposes, so most
> licensing schemes should be viable]
> - large: ideally billions of words, minimally hundreds of millions of
> tokens;
> - not too much work to convert it to plain text (e.g., I realize that
> I could create a corpus with the desired characteristics from the
> Dutch Wikipedia, but if somebody has already done it, I'd be happy to
> avoid re-doing the pre-processing myself.
> If anybody has such a corpus, or can link/put me in touch with someone
> who does, I'll be very grateful.
> Best,
> Marco

More information about the Corpora mailing list