[Corpora-List] Cyrillic tokenizer and sentence splitter

Andy Roberts andyr at comp.leeds.ac.uk
Fri May 13 10:39:04 CEST 2005


George,

I know you asked for an implementation in Perl, but I know of none.
However, I know of Java implementations that *should* work, although
I've not actually tested with Cyrillic languages.

I released jTokeniser 1.0 a couple of months back. It contains four
types of tokenisers, from basic whitespace tokenisers to more advanced
BreakIteratorTokeniser. Now, BreakIterator doesn't really mean a lot to
many people, but it's basically utiltises some special classes that
contain lots of built-in knowledge about lots of languages. The
BreaKIterator has algorithms to tokenise a string into tokens based on
the language, or more technically, the Locale, that you specify.

http://www.comp.leeds.ac.uk/andyr/software/jTokeniser/

It's worth noting that the BreakIterator, can be set to split sentences
too! My tokeniser doesn't provide this option (yet!). My tokeniser just
makes using BreakIterator much easier to use. For more on this, see:

http://java.sun.com/docs/books/tutorial/i18n/text/boundaryintro.html

I've always found Java more accomodating towards multilingual langauge
processing, as it was very much designed with internationalisation
(i18n) and locatalisation (l10n) issues from the outset. It maybe the
case that it would be quicker to learn some basic Java than to seek
similar functionality in Perl (or implementing your own.)

Andy Roberts


On Thu, 12 May 2005, George Mitrevski wrote:


> Hi folks.

>

> Can anyone reccomend a good perl sentence splitter and tokenizer that

> works well with Cyrillic characters/texts (Russian, Bulgarain, etc.)?

> I've tried some for English, German and other langauges, but they don;t

> do well with Cyrillic.

>

> thanks,

>

> George.

>

> Foreign Languages tel. 334-844-6376

> 6030 Haley Center fax. 334-844-6378

> Auburn University

> Auburn, AL 36849

> home: www.auburn.edu/~mitrege

>






More information about the Corpora-archive mailing list