[Corpora-List] Searching an easy-to-train Lemmatizer and POS tagger for Kyrgyz

Michael Ustaszewski Michael.Ustaszewski at uibk.ac.at
Thu Jan 31 15:19:50 CET 2019


Dear Jörg,

regarding your question about trainable POS taggers and lemmatizers for the Kyrgyz language: OpenNLP provides training interfaces for each of its modules (see https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html). Alternatively, you may consider the IXA pipes (http://ixa2.si.ehu.es/ixa-pipes/), which are based on OpenNLP and which provide exactly what you are looking for: easily trainable, language-independent tools, hence you can train your own models for tokenisation, lemmatisation, POS-tagging, NERC, and so on. Of course, you need suitable training corpora. Several input formats are supported by the IXA pipes training module. However, I am not aware of any training corpora for the Kyrgyz language, in the Universal Dependencies repository (https://universaldependencies.org/) I have seen that Kyrgyz is one of the upcoming languages.

As far as I know, the UDPipe (http://ufal.mff.cuni.cz/udpipe) can be trained with your own corpora. There is also an implementation of UDPipe in R, which you may also use to train your own models (see https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html).

Probably there are many more trainable NLP tools out there that might meet your requirements - the above mentioned are thos that I know and that I found easy to use.

Best wishes, Michael

Am 31.01.2019 um 11:25 schrieb corpora-request at uib.no:
> ------------------------------
>
> Message: 2
> Date: Wed, 30 Jan 2019 12:34:36 +0100
> From: Jörg Knappen <j.knappen at mx.uni-saarland.de>
> Subject: [Corpora-List] Searching an easy-to-train Lemmatizer and POS
> tagger for Kyrgyz
> To: corpora at uib.no
>
>
> I am searching for some tools usable for Lemmatising and POS-tagging
> Kyrgyz. Kyrgyz is a Turkic language (agglutinative) written with the
> Cyrillic alphabet. I don't expect pre-trained tools to be out there
> (when there is one, it would be great), but I hope to find something
> that can be trained easily (not needing to much training data).
>
> Thanks in advance,
>
> Jörg Knappen
>
>



More information about the Corpora mailing list