[Corpora-List] Tamil POS tagged Corpus

sobha L sobhanair at yahoo.com
Tue Sep 6 12:12:58 CEST 2016

We are pleased to announce the release of AUKBC Tamil Part-of-Speech Corpus and Part-of-Speech tagger Engine.

It was released on 24th May 2016 at the recently concluded 3rd Workshop on Indian Language Data: Resources and Evaluation (WILDRE 3), co-located with the 10th edition of the Language Resources and Evaluation Conference(LREC 2016) at Solvenia

This Tamil corpus (515K tokens) is the largest manually annotated POS tagged corpus available in Indian languages. The corpus is the famous20th century Tamil novel "Ponniyin Selvan" written by "KalkiKrishnamoorthy".

The corpus is annotated with the BIS Tagset, a hierarchical tagset which is approved by the Bureau of Indian Standards and Tamil Virtual Academy .

The Corpus Statistics: Total Number of sentences - 50,876 ; Number of words - 5,15,283

POS Tagger:

The POS Tagger engine is released under the GNU GPL version 3.0 license .

The corpus and the engine can be downloaded @ http://au-kbc.org/nlp/corpusrelease.html

for CLRG team at AU-KBC sobha

Dr. Sobha L (Lalitha Devi) CLRG, AU-KBC Research Centre, MIT, Anna University, Chennai www.au-kbc.org/nlp/

On Tuesday, September 6, 2016 3:03 PM, Albert Gatt <albert.gatt at um.edu.mt> wrote:

Dear Tobias

the Maltese Language Resource Server hosts corpora of the Maltese language with POS annotations: mlrs.research.um.edu.mt

best albert

On 28 July 2016 at 17:10, Horsmann, Tobias <tobias.horsmann at uni-due.de> wrote:

Hi everyone,I asked recently for suggestions for publicly available POS annotated corpora.Thanks for the answers. As promised I post my updated list. I am still looking for more POS annotated corpora so if you are aware of more available corpora then please tell me :) Norwegian (http://www.nb.no/sprakbanken/ show?serial=sbr-10)BrazPortugese Newswire (http://www.nltk.org/nltk_ data/)Dutch Alpino (https://www.let.rug.nl/ vannoord/trees/)Spanish (https://www.iula.upf.edu/ recurs01_tbk_uk.htm)Italian-TurinTree/Parallel (http://www.di.unito.it/~ tutreeb/treebanks.html)Polish National Corpus (http://nkjp.pl/index.php? page=14&lang=1)Icelandic-Historical Corpus (http://linguist.is/icelandic_ treebank/Icelandic_Parsed_ Historical_Corpus_(IcePaHC))Icelandic (http://www.malfong.is/index. php?lang=en&pg=mim)Slovene-English Parallel Corpus (http://nl.ijs.si/elan/)Finnish Treebank (http://www.ling.helsinki.fi/ kieliteknologia/tutkimus/ treebank/)German Tiger (http://www.ims.uni-stuttgart. de/forschung/ressourcen/ korpora/tiger.html) --Newly added------------------------- ------------------------------ -----German Hamburg Treebank (https://corpora.uni-hamburg. de/drupal/en/islandora/object/ treebank:hdt)Russian Open Corpus (http://opencorpora.org/?page= downloads)Multi Universial Dependencies (http://universaldependencies. org/)Italian-Pisa (http://www.corpusitaliano.it/ en/contents/description.html)English (https://corpling.uis. georgetown.edu/gum/)Coptic (https://github.com/ CopticScriptorium/corpora)French (https://deep-sequoia.inria. fr/corpus/)French (https://perso.limsi.fr/pap/ free_multitag.tgz)Danish (https://code.google.com/p/ copenhagen-dependency- treebank/)Croatian (http://nlp.ffzg.hr/resources/ corpora/setimes-hr/)Swedish Talbanken (http://stp.lingfil.uu.se/% 7Emojgan/UPDT.html)English Ted Talk Treebank (http://ahclab.naist.jp/ resource/tedtreebank/)  Best,Tobias ______________________________ _________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/ corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/ listinfo/corpora

-- Albert GattInstitute of LinguisticsUniversity of Maltahttp://staff.um.edu.mt/albert.gatt/ _______________________________________________ UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora

