[Corpora-List] POS annotated corpora

Amir Zeldes Amir.Zeldes at georgetown.edu
Fri Jul 22 15:47:17 CEST 2016

Hi Tobias,

For English, GUM is a 100% manually POS tagged corpus, which also contains a lot of other annotations:


Also, if you’re interested in Coptic, we have a bunch of either fully manually annotated or manually corrected POS tagged corpora here:


All of these corpora are free to download and use under a Creative Commons license.

Hope this helps,


From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Sara Castagnoli Sent: Friday, July 22, 2016 8:43 AM To: Horsmann, Tobias Cc: CORPORA at UIB.NO Subject: Re: [Corpora-List] POS annotated corpora

Hi Tobias,

as regards Italian, you may want to have a look at the PAISA' corpus (http://www.corpusitaliano.it/en/contents/description.html), a collection of Italian web texts licensed under Creative Commons. An annotated version can be downloaded directly from the website.

We performed manual evaluation of POS annotation on a small sample, and found that it achieves 96.34% accuracy.

Hope this helps,


On 20 July 2016 at 14:24, Horsmann, Tobias <tobias.horsmann at uni-due.de> wrote:

Hi everyone,

I am looking for part-of-speech annotated corpora in any languages.

Preferably hand-annotated or at least human-verified.

I would prefer corpora that are available for direct download without additional "sign a licence agreement" barriers.

Of course only material that is usable free of charge for research purposes so no "Data Consortium" or other resellers.

So far I found those:

Norwegian (http://www.nb.no/sprakbanken/show?serial=sbr-10)

BrazPortugese Newswire (http://www.nltk.org/nltk_data/)

Dutch Alpino (https://www.let.rug.nl/vannoord/trees/)

Spanish (https://www.iula.upf.edu/recurs01_tbk_uk.htm)

Italian-TurinTree/Parallel (http://www.di.unito.it/~tutreeb/treebanks.html)

Polish National Corpus (http://nkjp.pl/index.php?page=14 <http://nkjp.pl/index.php?page=14&lang=1> &lang=1)

Icelandic-Historical Corpus (http://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC))

Icelandic (http://www.malfong.is/index.php?lang=en <http://www.malfong.is/index.php?lang=en&pg=mim> &pg=mim)

Slovene-English Parallel Corpus (http://nl.ijs.si/elan/)

Finnish Treebank (http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/)

German Tiger (http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)

Is anyone aware of additional corpora that can be directly downloaded (I need an annotated file, no web interface).

I would appreciate suggestions to extend my current list and would post my final list once I am done collecting.



_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 13758 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160722/aa1b1704/attachment.txt>

More information about the Corpora mailing list