[Corpora-List] Sentence segmenting

Diana Maynard d.maynard at dcs.shef.ac.uk
Mon Aug 13 16:23:41 CEST 2012


Hi Jeff The sentence splitter in GATE is pretty accurate, especially for English. You can easily improve it for any language by adding your own abbreviation list or editing the existing one. The issues that usually foil it are related to line breaks in less formal kinds of documents, such as forum messages (but there are a couple of alternative versions of the splitter for just such an eventuality). Diana

On 13/08/12 14:35, Jeff Elmore wrote:
> I'm curious what folks are using these days for sentence segmenting for
> English.
>
> My application involves narrative and informational texts at a variety
> of reading levels and genres. Most text is hand-edited to eliminate
> non-prose content but any system that could respond robustly to unedited
> text would be awesome, of course.
>
> Mostly we've been using hand-crafted tools written in Python. I have
> checked out what NLTK offers but from what I've seen there's not
> anything terribly accurate in it (fails on obvious common cases like
> some honorifics). We did develop a decision tree based model using Weka
> for Spanish text. I'd be happy to do this again for English but wanted
> to see if there's something good already out there.
>
> Thanks in advance!



More information about the Corpora mailing list