[Corpora-List] Rule-based toolkit RDRPOSTagger for POS and morphological tagging

Dat Quoc Nguyen datquocnguyen at gmail.com
Wed May 18 15:39:16 CEST 2016


We would like to inform that we have just released the pre-trained Universal PoS tagging models for 40 languages in RDRPOSTagger version 1.2.2 (http://jldadmm.sourceforge.net/ - 10MB .zip downloaded file)

These pre-trained models are learned using the training data from the Universal dependencies project (version 1.3). The tagging accuracies on the UD v1.3 test sets are as follows:

Ancient_Greek : 91.5686507465 Ancient_Greek-PROIEL : 95.7193816885 Arabic : 94.414521 Basque : 92.4263559531 Bulgarian : 96.1294012966 Catalan : 96.5174210621 Chinese : 89.4522144522 Croatian : 93.8666666667 Czech : 97.676954331 Czech-CAC : 97.8256880734 Czech-CLTT : 97.008027244 Danish : 93.4738273283 Dutch : 88.7557761424 Dutch-LassySmall : 94.3665059185 English : 92.7040165763 English-LinES : 94.399245372 Estonian : 93.8360794254 Finnish : 92.2428884026 Finnish-FTB : 90.9631537 French : 95.2288488211 Galician : 96.3053855981 German : 90.3972909234 Gothic : 93.854207057 Greek : 96.8595624559 Hebrew : 93.5171584991 Hindi : 95.0239909681 Hungarian : 88.6894923259 Indonesian : 90.7470288625 Irish : 90.6045537817 Italian : 96.4843416674 Kazakh : 79.2207792208 Latin : 90.3973509934 Latin-ITTB : 98.2437385461 Latin-PROIEL : 95.786931437 Latvian : 86.3488080301 Norwegian : 94.602783512 Old_Church_Slavonic : 94.6249261666 Persian : 95.9982628118 Polish : 94.0848990953 Portuguese : 95.0814436282 Portuguese-BR : 95.0879815205 Romanian : 94.5197278912 Russian-SynTagRus : 97.6535452073 Slovenian : 94.0268790443 Slovenian-SST : 91.1555404947 Spanish : 95.1279527559 Spanish-AnCora : 96.7886891714 Swedish : 94.3956421456 Swedish-LinES : 94.4701020904 Tamil : 82.0888685295 Turkish : 91.9262341167

Noted that the accuracy is obtained with a weak initial tagger (the internal initial tagger developed inside RDRPOSTagger is simply based on a lexicon extracted from the training set). It is likely to obtain higher results with a stronger external initial tagger such as TnT.

For every language, the tagging speed is hundred of thousands of word tokens per second, computed for the Java implementation on a personal computer. It is about 10 times slower when using the Python implementation, compared against the Java implementation.

Best regards, RDRPOSTagger development team

On Wed, Apr 6, 2016 at 11:56 PM, Dai Quoc Nguyen <nquocdai at gmail.com> wrote:


> (Apologies for cross-posting)
> ***********************************************************************
> We are pleased to announce the release of RDRPOSTagger (version 1.2.1).
>
> RDRPOSTagger is a robust, easy-to-use and language-independent toolkit for
> POS and morphological tagging. It employs an error-driven approach to
> automatically construct tagging rules in the form of a binary tree. The
> main properties of RDRPOSTagger are as follows:
>
>
> - RDRPOSTagger obtains fast performance in both learning and tagging
> process. For example, RDRPOSTagger achieved tagging speeds of 5K and 90K
> English word tokens/second computed for single threaded implementations in
> Python and Java respectively, using a computer with Core2Duo 2.4GHz.
> - RDRPOSTagger achieves a very competitive accuracy in comparison to
> the state-of-the-art results. Please see experimental results including
> training time, tagging speed and tagging accuracy for 13 languages in the
> following paper:
>
> A Robust Transformation-Based Learning Approach Using Ripple Down Rules
> for Part-Of-Speech Tagging
> <http://content.iospress.com/articles/ai-communications/aic698>. *AI
> Communications*, to appear. [CameraReadyVersion.pdf]
> <http://arxiv.org/abs/1412.4021>
>
>
> - RDRPOSTagger supports pre-trained POS and morphological tagging
> models for 13 languages including Bulgarian, Czech, Dutch, English, French,
> German, Hindi, Italian, Portuguese, Spanish, Swedish, Thai and Vietnamese.
>
>
> Please find more information about RDRPOSTagger at its website:
> http://rdrpostagger.sourceforge.net
>
> Best regards,
> RDRPOSTagger development team
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6739 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160518/0dbcd0b7/attachment.txt>



More information about the Corpora mailing list