[Corpora-List] POS annotated corpora

Horsmann, Tobias tobias.horsmann at uni-due.de
Wed Jul 20 14:24:03 CEST 2016


Hi everyone,

I am looking for part-of-speech annotated corpora in any languages.

Preferably hand-annotated or at least human-verified.

I would prefer corpora that are available for direct download without additional "sign a licence agreement" barriers.

Of course only material that is usable free of charge for research purposes so no "Data Consortium" or other resellers.

So far I found those:

Norwegian (http://www.nb.no/sprakbanken/show?serial=sbr-10)

BrazPortugese Newswire (http://www.nltk.org/nltk_data/)

Dutch Alpino (https://www.let.rug.nl/vannoord/trees/)

Spanish (https://www.iula.upf.edu/recurs01_tbk_uk.htm)

Italian-TurinTree/Parallel (http://www.di.unito.it/~tutreeb/treebanks.html)

Polish National Corpus (http://nkjp.pl/index.php?page=14&lang=1)

Icelandic-Historical Corpus (http://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC))

Icelandic (http://www.malfong.is/index.php?lang=en&pg=mim)

Slovene-English Parallel Corpus (http://nl.ijs.si/elan/)

Finnish Treebank (http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/)

German Tiger (http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)

Is anyone aware of additional corpora that can be directly downloaded (I need an annotated file, no web interface).

I would appreciate suggestions to extend my current list and would post my final list once I am done collecting.

Best, Tobias -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6335 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160720/d1f02d9e/attachment.txt>



More information about the Corpora mailing list