[Corpora-List] Sentence segmenting

Steven Bird sb at csse.unimelb.edu.au
Tue Aug 14 10:17:47 CEST 2012


On 13 August 2012 23:35, Jeff Elmore <jelmore at lexile.com> wrote:
> I have checked
> out what NLTK offers but from what I've seen there's not anything terribly
> accurate in it (fails on obvious common cases like some honorifics).

Note that NLTK just uses Punkt, and this won't necessarily perform well if it uses an off-the-shelf model that was trained on data that contained different abbreviations to the test data:

"Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text." http://nltk.org/api/nltk.tokenize.html

-Steven Bird



More information about the Corpora mailing list