See http://corporafromtheweb.org/ for a Dutch corpus of about 4.7bn words. I wrote a small script to read in the corpus: https://github.com/evanmiltenburg/cowparser.
Emiel van Miltenburg, PhD-student | Faculty of Humanties, VU University Amsterdam, De Boelelaan 1105, 13A-77 |emiel.van.miltenburg at vu.nl<mailto:emiel.van.miltenburg at vu.nl>
[cid:image002.png at 01D012EE.E231DF20]
On 06 Feb 2015, at 08:54, corpora-request at uib.no<mailto:corpora-request at uib.no> wrote:
Message: 3 Date: Thu, 5 Feb 2015 14:09:13 +0100 From: Marco Baroni <marco.baroni at unitn.it<mailto:marco.baroni at unitn.it>> Subject: [Corpora-List] looking for dutch corpus To: <CORPORA at UIB.NO<mailto:CORPORA at UIB.NO>>
I am looking for a Dutch corpus with the following characteristics:
- I can download it (for free or for a fee) and process it with my own tools (as opposed to having just online access); [I will not redistribute it, I will acknowledge the source in any published work based on it, and I will not use it for commercial purposes, so most licensing schemes should be viable]
- large: ideally billions of words, minimally hundreds of millions of tokens;
- not too much work to convert it to plain text (e.g., I realize that I could create a corpus with the desired characteristics from the Dutch Wikipedia, but if somebody has already done it, I'd be happy to avoid re-doing the pre-processing myself.
If anybody has such a corpus, or can link/put me in touch with someone who does, I'll be very grateful.
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 24499 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150206/b8ef792a/attachment.txt> -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 13499 bytes Desc: image002.png URL: <https://mailman.uib.no/public/corpora/attachments/20150206/b8ef792a/attachment.png>