I'd like to use the Spanish portion of the Wikipedia corpora that were recently announced on this list (see below). Has anybody processed it with a standard NLP pipeline (tokenization, lemmatization, POS tagging would be enough for my purposes) and is willing to share the processed version? It'd save me quite some time.
> 1. Wikipedia Monolingual Corpora: more than 5 billion tokens of text in 23
> languages extracted from the Wikipedia. The corpora are annotated with
> article and paragraph boundaries, number of incoming links for each
> article, anchor texts used to refer to each article (textlinks) and their
> frequencies, crosslanguage links, categories and more (
> is also a script that allows to extract domain-specific sub-corpora if you
> provide a list of desired categories.
-- Gemma Boleda Universitat Pompeu Fabra http://gboleda.utcompling.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2661 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20141218/d811ab30/attachment.txt>