[Corpora-List] Sentence segmenting

Marcin Miłkowski list-address at wp.pl
Tue Aug 14 00:29:23 CEST 2012


Hi Jeff,

if you want to reuse translator's resources (and computer-aided translation tools need to have text segmented into sentences), you can use SRX standard. I have authored some rules for English, though they are not perfect (I have a much better set of rules for Polish). The open-source library that supports SRX, segment, is also pretty fast.

The paper is here:

http://marcinmilkowski.pl/downloads/ltc-043-milkowski.pdf

The rules are here:

http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/resource/segment.srx?revision=7751

Regards, Marcin

W dniu 2012-08-13 22:20, Sebastian Nagel pisze:
> Hi Jeff,
>
> two years ago there was an exhaustive summary of a similar request:
> http://mailman.uib.no/public/corpora/2010-August/011367.html
>
> But check the list archives (or Google) for
> "sentence (splitt(er|ing)|boundar(y|ies)|detector)" or similar.
> There have been a couple of threads during the last years.
>
> Regards,
> Sebastian
>
> On 08/13/2012 03:35 PM, Jeff Elmore wrote:
>> I'm curious what folks are using these days for sentence segmenting for
>> English.
>>
>> My application involves narrative and informational texts at a variety of
>> reading levels and genres. Most text is hand-edited to eliminate non-prose
>> content but any system that could respond robustly to unedited text would
>> be awesome, of course.
>>
>> Mostly we've been using hand-crafted tools written in Python. I have
>> checked out what NLTK offers but from what I've seen there's not anything
>> terribly accurate in it (fails on obvious common cases like some
>> honorifics). We did develop a decision tree based model using Weka for
>> Spanish text. I'd be happy to do this again for English but wanted to see
>> if there's something good already out there.
>>
>> Thanks in advance!
>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>



More information about the Corpora mailing list