[Corpora-List] looking for dutch corpus

Alessandro Raganato raganato at di.uniroma1.it
Thu Feb 5 14:37:50 CET 2015


you can check the polyglot project. They offer a processed Wikipedia dumps that have tokenized text. https://sites.google.com/site/rmyeid/projects/polyglot#TOC-Download-Wikipedia-Text-Dumps

On Thu, Feb 5, 2015 at 2:09 PM, Marco Baroni <marco.baroni at unitn.it> wrote:


> Dear All,
>
> I am looking for a Dutch corpus with the following characteristics:
>
> - I can download it (for free or for a fee) and process it with my own
> tools (as opposed to having just online access); [I will not redistribute
> it, I will acknowledge the source in any published work based on it, and I
> will not use it for commercial purposes, so most licensing schemes should
> be viable]
>
> - large: ideally billions of words, minimally hundreds of millions of
> tokens;
>
> - not too much work to convert it to plain text (e.g., I realize that I
> could create a corpus with the desired characteristics from the Dutch
> Wikipedia, but if somebody has already done it, I'd be happy to avoid
> re-doing the pre-processing myself.
>
> If anybody has such a corpus, or can link/put me in touch with someone who
> does, I'll be very grateful.
>
> Best,
>
> Marco
>
>
>
>
> --
> Marco Baroni
> Center for Mind/Brain Sciences (CIMeC)
> University of Trento
> http://clic.cimec.unitn.it/marco
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2546 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150205/192bd7db/attachment.txt>



More information about the Corpora mailing list