[Corpora-List] Dutch corpus

Miltenburg, C.W.J. van emiel.van.miltenburg at vu.nl
Fri Feb 6 11:36:40 CET 2015


Hi Marco,

See http://corporafromtheweb.org/ for a Dutch corpus of about 4.7bn words. I wrote a small script to read in the corpus: https://github.com/evanmiltenburg/cowparser.

Best wishes,

Emiel van Miltenburg, PhD-student | Faculty of Humanties, VU University Amsterdam, De Boelelaan 1105, 13A-77 |emiel.van.miltenburg at vu.nl<mailto:emiel.van.miltenburg at vu.nl>

[cid:image002.png at 01D012EE.E231DF20]

On 06 Feb 2015, at 08:54, corpora-request at uib.no<mailto:corpora-request at uib.no> wrote:

Message: 3 Date: Thu, 5 Feb 2015 14:09:13 +0100 From: Marco Baroni <marco.baroni at unitn.it<mailto:marco.baroni at unitn.it>> Subject: [Corpora-List] looking for dutch corpus To: <CORPORA at UIB.NO<mailto:CORPORA at UIB.NO>>

Dear All,

I am looking for a Dutch corpus with the following characteristics:

- I can download it (for free or for a fee) and process it with my own tools (as opposed to having just online access); [I will not redistribute it, I will acknowledge the source in any published work based on it, and I will not use it for commercial purposes, so most licensing schemes should be viable]

- large: ideally billions of words, minimally hundreds of millions of tokens;

- not too much work to convert it to plain text (e.g., I realize that I could create a corpus with the desired characteristics from the Dutch Wikipedia, but if somebody has already done it, I'd be happy to avoid re-doing the pre-processing myself.

If anybody has such a corpus, or can link/put me in touch with someone who does, I'll be very grateful.

Best,

Marco

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 24499 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150206/b8ef792a/attachment.txt> -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 13499 bytes Desc: image002.png URL: <https://mailman.uib.no/public/corpora/attachments/20150206/b8ef792a/attachment.png>



More information about the Corpora mailing list