I am looking for part-of-speech annotated corpora in any languages.
Preferably hand-annotated or at least human-verified.
I would prefer corpora that are available for direct download without additional "sign a licence agreement" barriers.
Of course only material that is usable free of charge for research purposes so no "Data Consortium" or other resellers.
So far I found those:
Norwegian (http://www.nb.no/sprakbanken/show?serial=sbr-10)
BrazPortugese Newswire (http://www.nltk.org/nltk_data/)
Dutch Alpino (https://www.let.rug.nl/vannoord/trees/)
Spanish (https://www.iula.upf.edu/recurs01_tbk_uk.htm)
Italian-TurinTree/Parallel (http://www.di.unito.it/~tutreeb/treebanks.html)
Polish National Corpus (http://nkjp.pl/index.php?page=14&lang=1)
Icelandic-Historical Corpus (http://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC))
Icelandic (http://www.malfong.is/index.php?lang=en&pg=mim)
Slovene-English Parallel Corpus (http://nl.ijs.si/elan/)
Finnish Treebank (http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/)
German Tiger (http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
Is anyone aware of additional corpora that can be directly downloaded (I need an annotated file, no web interface).
I would appreciate suggestions to extend my current list and would post my final list once I am done collecting.
Best, Tobias -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6335 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160720/d1f02d9e/attachment.txt>