[Corpora-List] SUMMARY: sentence boundary detectors

Armin Schmidt armin.sch at gmail.com
Fri Mar 2 22:42:10 CET 2007

Dear all,

thank you for all the helpful responses. I was preparing several
parallel corpora for a machine translation task between the languages
German, Russian, English, and Spanish. In order to achieve good results
from sentence alignment, I was looking for a sentence splitter that
would perform equally well on all the data sets and, if at all, make the
same or similar errors for all the languages. Also, I didn't have any
lists of abbreviations.

A particularly nice response I received from Jan Strunk who kindly
provided a preliminary implementation of his system 'Punkt'
'Punkt' learns abbreviations and sentence boundaries in a
language-independent, unsupervised manner.

Links to similar tools for one or several languages were of great help,
too. They are:

http://aot.ru/download/graphan.tar.gz (source in C++, dll is included in

German, Russian, English:
(fast, rule-based).

(rule-based, Java)

Tools of the SRI LM toolkit: http://www.speech.sri.com/projects/srilm/

Needs to be provided with a set of abbreviations for a particular language:

For Perl, there are several modules available on http://www.cpan.org/
which can be extended for other than the given languages. E.g. for
Russian, use EN::Sentence and add acronym list:

Thanks again & best regards,

Armin Schmidt schrieb:

