[Corpora-List] POS annotated corpora

Sara Castagnoli sara.castagnoli at gmail.com
Fri Jul 22 14:43:14 CEST 2016


Hi Tobias,

as regards Italian, you may want to have a look at the PAISA' corpus ( http://www.corpusitaliano.it/en/contents/description.html), a collection of Italian web texts licensed under Creative Commons. An annotated version can be downloaded directly from the website.

We performed manual evaluation of POS annotation on a small sample, and found that it achieves 96.34% accuracy.

Hope this helps, Sara

On 20 July 2016 at 14:24, Horsmann, Tobias <tobias.horsmann at uni-due.de> wrote:


> Hi everyone,
>
>
>
> I am looking for part-of-speech annotated corpora in any
> languages.
>
> Preferably hand-annotated or at least human-verified.
>
> I would prefer corpora that are available for direct download
> without additional "sign a licence agreement" barriers.
>
> Of course only material that is usable free of charge for
> research purposes so no "Data Consortium" or other resellers.
>
>
>
> So far I found those:
>
> Norwegian (http://www.nb.no/sprakbanken/show?serial=sbr-10)
>
> BrazPortugese Newswire (http://www.nltk.org/nltk_data/)
>
> Dutch Alpino (https://www.let.rug.nl/vannoord/trees/)
>
> Spanish (https://www.iula.upf.edu/recurs01_tbk_uk.htm)
>
> Italian-TurinTree/Parallel (
> http://www.di.unito.it/~tutreeb/treebanks.html)
>
> Polish National Corpus (
> http://nkjp.pl/index.php?page=14&lang=1)
>
> Icelandic-Historical Corpus (
> http://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)
> )
>
> Icelandic (http://www.malfong.is/index.php?lang=en&pg=mim)
>
> Slovene-English Parallel Corpus (http://nl.ijs.si/elan/)
>
> Finnish Treebank (
> http://www.ling.helsinki.fi/kieliteknologia/tutkimus/treebank/)
>
> German Tiger (
> http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)
>
>
>
> Is anyone aware of additional corpora that can be directly
> downloaded (I need an annotated file, no web interface).
>
> I would appreciate suggestions to extend my current list and
> would post my final list once I am done collecting.
>
>
>
> Best,
>
> Tobias
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6204 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160722/86e694ee/attachment.txt>



More information about the Corpora mailing list