[Corpora-List] Searching an easy-to-train Lemmatizer and POS tagger for Kyrgyz (Jörg Knappen)

Koos Wilt kooswilt at gmail.com
Thu Jan 31 15:15:37 CET 2019


Try the Python NLTK tagger and lemmatizer. They are easily integrated with the other NLTK stuff, which constitutes great package.

Op do 31 jan. 2019 om 15:09 schreef Michael Ustaszewski < Michael.Ustaszewski at uibk.ac.at>:


> Re: Searching an easy-to-train Lemmatizer and POS tagger for Kyrgyz
>
> Dear Jörg,
>
> regarding your question about trainable POS taggers and lemmatizers for
> the Kyrgyz language: OpenNLP provides training interfaces for each of
> its modules (see
> https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html).
> Alternatively, you may consider the IXA pipes
> (http://ixa2.si.ehu.es/ixa-pipes/), which are based on OpenNLP and which
> provide exactly what you are looking for: easily trainable,
> language-independent tools, hence you can train your own models for
> tokenisation, lemmatisation, POS-tagging, NERC, and so on. Of course,
> you need suitable training corpora. Several input formats are supported
> by the IXA pipes training module. However, I am not aware of any
> training corpora for the Kyrgyz language, in the Universal Dependencies
> repository (https://universaldependencies.org/) I have seen that Kyrgyz
> is one of the upcoming languages.
>
> As far as I know, the UDPipe (http://ufal.mff.cuni.cz/udpipe) can be
> trained with your own corpora. There is also an implementation of UDPipe
> in R, which you may also use to train your own models (see
>
> https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-annotation.html
> ).
>
> Probably there are many more trainable NLP tools out there that might
> meet your requirements - the above mentioned are thos that I know and
> that I found easy to use.
>
> Best wishes,
> Michael
>
> Am 31.01.2019 um 11:25 schrieb corpora-request at uib.no:
> > ------------------------------
> >
> > Message: 2
> > Date: Wed, 30 Jan 2019 12:34:36 +0100
> > From: Jörg Knappen <j.knappen at mx.uni-saarland.de>
> > Subject: [Corpora-List] Searching an easy-to-train Lemmatizer and POS
> > tagger for Kyrgyz
> > To: corpora at uib.no
> >
> >
> > I am searching for some tools usable for Lemmatising and POS-tagging
> > Kyrgyz. Kyrgyz is a Turkic language (agglutinative) written with the
> > Cyrillic alphabet. I don't expect pre-trained tools to be out there
> > (when there is one, it would be great), but I hope to find something
> > that can be trained easily (not needing to much training data).
> >
> > Thanks in advance,
> >
> > Jörg Knappen
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4022 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20190131/62ee1efc/attachment.txt>



More information about the Corpora mailing list