[Corpora-List] tokenizer & sentence boundary detection

Jimmy O'Regan joregan at gmail.com
Mon Jun 14 15:35:34 CEST 2010


On 14 June 2010 13:54, Joerg Tiedemann <jorg.tiedemann at lingfil.uu.se> wrote:
>
> I'm looking for freely available tokenizers and sentence splitters for
> various languages. I am interested in language-specific and
> language-independent/generic tools. I am also interested in domain-specific
> tokenizers - anything (off-the-shelf) that can easily be used on large scale
> corpora.

There's the Java-based program, Segment (https://sourceforge.net/projects/segment/ MIT-type licence) which uses SRX rules for sentence splitting. It includes a library for sentence splitting, which is used by LanguageTool and the Maligna sentence aligner.

-- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you.



More information about the Corpora mailing list