[Corpora-List] RFTokenizer - morphological segmentation for Hebrew and other MRLs

Amir Zeldes Amir.Zeldes at georgetown.edu
Mon Sep 17 23:03:28 CEST 2018


** Apologies for cross-postings **

We are pleased to announce the release of RFTokenizer (V0.9), an automatic segmenter for complex words in morphologically rich languages. RFTokenizer can be downloaded under the Apache license here:

https://github.com/amir-zeldes/RFTokenizer

Pre-trained models are provided for Hebrew and Coptic, with SOA segmentation results for the those two languages on the official test sets. The system can also be trained on further languages and datasets.

For more about the system please refer to this paper:

Zeldes, Amir (2018) A Characterwise Windowed Approach to Hebrew Morphological Segmentation. In: Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Brussels, Belgium.

https://arxiv.org/abs/1808.07214

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2799 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180917/c6a74fb0/attachment.txt>



More information about the Corpora mailing list