[Corpora-List] looking for dutch corpus

Maarten van Gompel proycon at anaproy.nl
Thu Feb 5 14:42:42 CET 2015


Quoting Marco Baroni (2015-02-05 14:09:13)
> Dear All,
>
> I am looking for a Dutch corpus with the following characteristics:
>
> - I can download it (for free or for a fee) and process it with my own
> tools (as opposed to having just online access); [I will not
> redistribute it, I will acknowledge the source in any published work
> based on it, and I will not use it for commercial purposes, so most
> licensing schemes should be viable]
>
> - large: ideally billions of words, minimally hundreds of millions of
> tokens;
>
> - not too much work to convert it to plain text (e.g., I realize that I
> could create a corpus with the desired characteristics from the Dutch
> Wikipedia, but if somebody has already done it, I'd be happy to avoid
> re-doing the pre-processing myself.
>
> If anybody has such a corpus, or can link/put me in touch with someone
> who does, I'll be very grateful.

Hi Marcus,

One of the biggest Dutch corpora is the SoNaR-500 corpus, spanning more than 500 million words from a wide variety domains of both Dutch and Flemish texts. The corpus is tokenised, PoS-tagged, lemmatised and annotated with Named Entities.

It is delivered in the FoLiA XML format (http://proycon.github.io/folia/) ; simple tools are available for conversion to plain text.

I believe the official way to obtain it, free of charge for non-commercial use, is through the TST-Centrale: http://tst-centrale.org/en/producten/corpora/sonar-corpus/6-85 , they should be able to offer a full download.

Another good large corpus, of web-crawled data, is "Corpora from the Web" (COW), delivered in a simple ad-hoc XML format: http://corporafromtheweb.org/nlcow14/

Regards,

--

Maarten van Gompel

Centre for Language Studies

Radboud Universiteit Nijmegen

proycon at anaproy.nl http://proycon.anaproy.nl http://github.com/proycon

GnuPG key: 0x1A31555C XMPP: proycon at anaproy.nl Bitcoin: 1BRptZsKQtqRGSZ5qKbX2azbfiygHxJPsd



More information about the Corpora mailing list