[Corpora-List] Has anybody processed Linguatools' Spanish Wikipedia corpus?

Gemma Boleda gemma.boleda at upf.edu
Thu Dec 18 23:28:04 CET 2014


Dear colleagues,

I'd like to use the Spanish portion of the Wikipedia corpora that were recently announced on this list (see below). Has anybody processed it with a standard NLP pipeline (tokenization, lemmatization, POS tagging would be enough for my purposes) and is willing to share the processed version? It'd save me quite some time.

Thank you,

Gemma Boleda.


> 1. Wikipedia Monolingual Corpora: more than 5 billion tokens of text in 23
> languages extracted from the Wikipedia. The corpora are annotated with
> article and paragraph boundaries, number of incoming links for each
> article, anchor texts used to refer to each article (textlinks) and their
> frequencies, crosslanguage links, categories and more (
> http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/).
> There
> is also a script that allows to extract domain-specific sub-corpora if you
> provide a list of desired categories.
>

-- Gemma Boleda Universitat Pompeu Fabra http://gboleda.utcompling.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2661 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20141218/d811ab30/attachment.txt>



More information about the Corpora mailing list