[Corpora-List] tokenizer & sentence boundary detection
joregan at gmail.com
Mon Jun 14 15:35:34 CEST 2010
On 14 June 2010 13:54, Joerg Tiedemann <jorg.tiedemann at lingfil.uu.se> wrote:
> I'm looking for freely available tokenizers and sentence splitters for
> various languages. I am interested in language-specific and
> language-independent/generic tools. I am also interested in domain-specific
> tokenizers - anything (off-the-shelf) that can easily be used on large scale
There's the Java-based program, Segment
(https://sourceforge.net/projects/segment/ MIT-type licence) which
uses SRX rules for sentence splitting. It includes a library for
sentence splitting, which is used by LanguageTool and the Maligna
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
More information about the Corpora