I am looking for a Dutch corpus with the following characteristics:
- I can download it (for free or for a fee) and process it with my own tools (as opposed to having just online access); [I will not redistribute it, I will acknowledge the source in any published work based on it, and I will not use it for commercial purposes, so most licensing schemes should be viable]
- large: ideally billions of words, minimally hundreds of millions of tokens;
- not too much work to convert it to plain text (e.g., I realize that I could create a corpus with the desired characteristics from the Dutch Wikipedia, but if somebody has already done it, I'd be happy to avoid re-doing the pre-processing myself.
If anybody has such a corpus, or can link/put me in touch with someone who does, I'll be very grateful.
-- Marco Baroni Center for Mind/Brain Sciences (CIMeC) University of Trento http://clic.cimec.unitn.it/marco