For English, GUM is a 100% manually POS tagged corpus, which also contains a lot of other annotations:
https://corpling.uis.georgetown.edu/gum/
Also, if you’re interested in Coptic, we have a bunch of either fully manually annotated or manually corrected POS tagged corpora here:
All of these corpora are free to download and use under a Creative Commons license.
Hope this helps,
Amir
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Sara Castagnoli Sent: Friday, July 22, 2016 8:43 AM To: Horsmann, Tobias Cc: CORPORA at UIB.NO Subject: Re: [Corpora-List] POS annotated corpora
Hi Tobias,
as regards Italian, you may want to have a look at the PAISA' corpus (http://www.corpusitaliano.it/en/contents/description.html), a collection of Italian web texts licensed under Creative Commons. An annotated version can be downloaded directly from the website.
We performed manual evaluation of POS annotation on a small sample, and found that it achieves 96.34% accuracy.
Hope this helps,
Sara
On 20 July 2016 at 14:24, Horsmann, Tobias <tobias.horsmann at uni-due.de> wrote:
Hi everyone,
I am looking for part-of-speech annotated corpora in any languages.
Preferably hand-annotated or at least human-verified.
I would prefer corpora that are available for direct download without additional "sign a licence agreement" barriers.
Of course only material that is usable free of charge for research purposes so no "Data Consortium" or other resellers.
So far I found those:
Norwegian (http://www.nb.no/sprakbanken/show?serial=sbr-10)
BrazPortugese Newswire (http://www.nltk.org/nltk_data/)
Dutch Alpino (https://www.let.rug.nl/vannoord/trees/)
Spanish (https://www.iula.upf.edu/recurs01_tbk_uk.htm)
Italian-TurinTree/Parallel (http://www.di.unito.it/~tutreeb/treebanks.html)
Polish National Corpus (http://nkjp.pl/index.php?page=14 <http://nkjp.pl/index.php?page=14&lang=1> &lang=1)
Icelandic-Historical Corpus (http://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC))
Icelandic (http://www.malfong.is/index.php?lang=en <http://www.malfong.is/index.php?lang=en&pg=mim> &pg=mim)
Slovene-English Parallel Corpus (http://nl.ijs.si/elan/)
Finnish Treebank (http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/)
German Tiger (http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
Is anyone aware of additional corpora that can be directly downloaded (I need an annotated file, no web interface).
I would appreciate suggestions to extend my current list and would post my final list once I am done collecting.
Best,
Tobias
_______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 13758 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160722/aa1b1704/attachment.txt>