[Corpora-List] Searching an easy-to-train Lemmatizer and POS tagger for Kyrgyz (Jörg Knappen)

Toms Bergmanis tomsbergmanis at gmail.com
Thu Jan 31 16:55:27 CET 2019


Hi Michael, Lemming is a well performing non-neural lemmatizer/tagger. I used it in as a strong baseline in my paper on Context Sensitive Neural Lemmatization with Lematus ( http://homepages.inf.ed.ac.uk/s1044253/papers/Context_Sensitive_Neural_Lemmatization_with_Lematus.pdf ). While it performs worse than Lematus on lemmatization, it jointly performs POS/morphological tagger. Also it's training time is fairly short. If you are interested in Lemming's lemmatization accuracies across 20 different languages in various data settings you can check out the result tables here: https://docs.google.com/spreadsheets/d/115mizFo9CYORI6MHshC-c3tRFq5cuBTPwkBaFY8kggU/edit Let me know what you have any questions. Toms Bergmanis

On Thu, 31 Jan 2019, 14:28 Koos Wilt <kooswilt at gmail.com wrote:


> Try the Python NLTK tagger and lemmatizer. They are easily integrated
> with the other NLTK stuff, which constitutes great package.
>
> Op do 31 jan. 2019 om 15:09 schreef Michael Ustaszewski <
> Michael.Ustaszewski at uibk.ac.at>:
>
>> Re: Searching an easy-to-train Lemmatizer and POS tagger for Kyrgyz
>>
>> Dear Jörg,
>>
>> regarding your question about trainable POS taggers and lemmatizers for
>> the Kyrgyz language: OpenNLP provides training interfaces for each of
>> its modules (see
>> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html).
>> Alternatively, you may consider the IXA pipes
>> (http://ixa2.si.ehu.es/ixa-pipes/), which are based on OpenNLP and which
>> provide exactly what you are looking for: easily trainable,
>> language-independent tools, hence you can train your own models for
>> tokenisation, lemmatisation, POS-tagging, NERC, and so on. Of course,
>> you need suitable training corpora. Several input formats are supported
>> by the IXA pipes training module. However, I am not aware of any
>> training corpora for the Kyrgyz language, in the Universal Dependencies
>> repository (https://universaldependencies.org/) I have seen that Kyrgyz
>> is one of the upcoming languages.
>>
>> As far as I know, the UDPipe (http://ufal.mff.cuni.cz/udpipe) can be
>> trained with your own corpora. There is also an implementation of UDPipe
>> in R, which you may also use to train your own models (see
>>
>> https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html
>> ).
>>
>> Probably there are many more trainable NLP tools out there that might
>> meet your requirements - the above mentioned are thos that I know and
>> that I found easy to use.
>>
>> Best wishes,
>> Michael
>>
>> Am 31.01.2019 um 11:25 schrieb corpora-request at uib.no:
>> > ------------------------------
>> >
>> > Message: 2
>> > Date: Wed, 30 Jan 2019 12:34:36 +0100
>> > From: Jörg Knappen <j.knappen at mx.uni-saarland.de>
>> > Subject: [Corpora-List] Searching an easy-to-train Lemmatizer and POS
>> > tagger for Kyrgyz
>> > To: corpora at uib.no
>> >
>> >
>> > I am searching for some tools usable for Lemmatising and POS-tagging
>> > Kyrgyz. Kyrgyz is a Turkic language (agglutinative) written with the
>> > Cyrillic alphabet. I don't expect pre-trained tools to be out there
>> > (when there is one, it would be great), but I hope to find something
>> > that can be trained easily (not needing to much training data).
>> >
>> > Thanks in advance,
>> >
>> > Jörg Knappen
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
>>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6093 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190131/7605ccbb/attachment.txt>



More information about the Corpora mailing list