[Corpora-List] Searching an easy-to-train Lemmatizer and POS tagger for Kyrgyz (Jörg Knappen)

Toms Bergmanis tomsbergmanis at gmail.com
Thu Jan 31 17:01:38 CET 2019


Sorry, the previous message was addressed to Jorg.

On Thu, 31 Jan 2019, 15:55 Toms Bergmanis <tomsbergmanis at gmail.com wrote:


> Hi Michael,
> Lemming is a well performing non-neural lemmatizer/tagger. I used it in as
> a strong baseline in my paper on Context Sensitive Neural Lemmatization
> with Lematus (
> http://homepages.inf.ed.ac.uk/s1044253/papers/Context_Sensitive_Neural_Lemmatization_with_Lematus.pdf
> ).
> While it performs worse than Lematus on lemmatization, it jointly performs
> POS/morphological tagger. Also it's training time is fairly short.
> If you are interested in Lemming's lemmatization accuracies across 20
> different languages in various data settings you can check out the result
> tables here:
>
> https://docs.google.com/spreadsheets/d/115mizFo9CYORI6MHshC-c3tRFq5cuBTPwkBaFY8kggU/edit
> Let me know what you have any questions.
> Toms Bergmanis
>
> On Thu, 31 Jan 2019, 14:28 Koos Wilt <kooswilt at gmail.com wrote:
>
>> Try the Python NLTK tagger and lemmatizer. They are easily integrated
>> with the other NLTK stuff, which constitutes great package.
>>
>> Op do 31 jan. 2019 om 15:09 schreef Michael Ustaszewski <
>> Michael.Ustaszewski at uibk.ac.at>:
>>
>>> Re: Searching an easy-to-train Lemmatizer and POS tagger for Kyrgyz
>>>
>>> Dear Jörg,
>>>
>>> regarding your question about trainable POS taggers and lemmatizers for
>>> the Kyrgyz language: OpenNLP provides training interfaces for each of
>>> its modules (see
>>> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html).
>>> Alternatively, you may consider the IXA pipes
>>> (http://ixa2.si.ehu.es/ixa-pipes/), which are based on OpenNLP and
>>> which
>>> provide exactly what you are looking for: easily trainable,
>>> language-independent tools, hence you can train your own models for
>>> tokenisation, lemmatisation, POS-tagging, NERC, and so on. Of course,
>>> you need suitable training corpora. Several input formats are supported
>>> by the IXA pipes training module. However, I am not aware of any
>>> training corpora for the Kyrgyz language, in the Universal Dependencies
>>> repository (https://universaldependencies.org/) I have seen that Kyrgyz
>>> is one of the upcoming languages.
>>>
>>> As far as I know, the UDPipe (http://ufal.mff.cuni.cz/udpipe) can be
>>> trained with your own corpora. There is also an implementation of UDPipe
>>> in R, which you may also use to train your own models (see
>>>
>>> https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html
>>> ).
>>>
>>> Probably there are many more trainable NLP tools out there that might
>>> meet your requirements - the above mentioned are thos that I know and
>>> that I found easy to use.
>>>
>>> Best wishes,
>>> Michael
>>>
>>> Am 31.01.2019 um 11:25 schrieb corpora-request at uib.no:
>>> > ------------------------------
>>> >
>>> > Message: 2
>>> > Date: Wed, 30 Jan 2019 12:34:36 +0100
>>> > From: Jörg Knappen <j.knappen at mx.uni-saarland.de>
>>> > Subject: [Corpora-List] Searching an easy-to-train Lemmatizer and POS
>>> > tagger for Kyrgyz
>>> > To: corpora at uib.no
>>> >
>>> >
>>> > I am searching for some tools usable for Lemmatising and POS-tagging
>>> > Kyrgyz. Kyrgyz is a Turkic language (agglutinative) written with the
>>> > Cyrillic alphabet. I don't expect pre-trained tools to be out there
>>> > (when there is one, it would be great), but I hope to find something
>>> > that can be trained easily (not needing to much training data).
>>> >
>>> > Thanks in advance,
>>> >
>>> > Jörg Knappen
>>>
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> https://mailman.uib.no/listinfo/corpora
>>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
>>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6712 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190131/eee1a152/attachment.txt>



More information about the Corpora mailing list