[Corpora-List] Sentence segmenting

Xu Jiajin ustcxujj at gmail.com
Tue Aug 14 07:42:34 CEST 2012


Hi Jeff,

You might like to try our standalone English sentence segmenter, which can be downloaded at http://www.fleric.org.cn/pub/ss.rar

Jiajin

Jiajin Xu PhD, associate professor National Research Centre for Foreign Language Education Beijing Foreign Studies University

On Tue, Aug 14, 2012 at 6:29 AM, Marcin Miłkowski <list-address at wp.pl>wrote:


> Hi Jeff,
>
> if you want to reuse translator's resources (and computer-aided
> translation tools need to have text segmented into sentences), you can use
> SRX standard. I have authored some rules for English, though they are not
> perfect (I have a much better set of rules for Polish). The open-source
> library that supports SRX, segment, is also pretty fast.
>
> The paper is here:
>
> http://marcinmilkowski.pl/**downloads/ltc-043-milkowski.**pdf<http://marcinmilkowski.pl/downloads/ltc-043-milkowski.pdf>
>
> The rules are here:
>
> http://languagetool.svn.**sourceforge.net/viewvc/**languagetool/trunk/**
> JLanguageTool/src/resource/**segment.srx?revision=7751<http://languagetool.svn.sourceforge.net/viewvc/languagetool/trunk/JLanguageTool/src/resource/segment.srx?revision=7751>
>
> Regards,
> Marcin
>
> W dniu 2012-08-13 22:20, Sebastian Nagel pisze:
>
> Hi Jeff,
>>
>> two years ago there was an exhaustive summary of a similar request:
>> http://mailman.uib.no/public/**corpora/2010-August/011367.**html<http://mailman.uib.no/public/corpora/2010-August/011367.html>
>>
>> But check the list archives (or Google) for
>> "sentence (splitt(er|ing)|boundar(y|ies)**|detector)" or similar.
>> There have been a couple of threads during the last years.
>>
>> Regards,
>> Sebastian
>>
>> On 08/13/2012 03:35 PM, Jeff Elmore wrote:
>>
>>> I'm curious what folks are using these days for sentence segmenting for
>>> English.
>>>
>>> My application involves narrative and informational texts at a variety of
>>> reading levels and genres. Most text is hand-edited to eliminate
>>> non-prose
>>> content but any system that could respond robustly to unedited text would
>>> be awesome, of course.
>>>
>>> Mostly we've been using hand-crafted tools written in Python. I have
>>> checked out what NLTK offers but from what I've seen there's not anything
>>> terribly accurate in it (fails on obvious common cases like some
>>> honorifics). We did develop a decision tree based model using Weka for
>>> Spanish text. I'd be happy to do this again for English but wanted to see
>>> if there's something good already out there.
>>>
>>> Thanks in advance!
>>>
>>>
>>>
>>> ______________________________**_________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>>>
>>>
>>
>> ______________________________**_________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>>
>>
>>
>
> ______________________________**_________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4632 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120814/c50a7566/attachment.txt>



More information about the Corpora mailing list