[Corpora-List] Sentence segmenting

Adam Radziszewski kocikikut at gmail.com
Tue Aug 14 09:51:06 CEST 2012


On 14 August 2012 00:29, Marcin Miłkowski <list-address at wp.pl> wrote:


> Hi Jeff,
>
> if you want to reuse translator's resources (and computer-aided
> translation tools need to have text segmented into sentences), you can use
> SRX standard. I have authored some rules for English, though they are not
> perfect (I have a much better set of rules for Polish). The open-source
> library that supports SRX, segment, is also pretty fast.
>

In case you're interested in using SRX rules, you may also consider trying our C++ implementation <http://nlp.pwr.wroc.pl/redmine/projects/toki/wiki/>(GNU LGPL). The processing speed in terms of tokens per sec is similar to Marcin Miłkowski's Java segment tool, but if many short texts are to be processed it might be convenient to get rid of Java VM start-up time.

Best, Adam Radziszewski -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1193 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120814/764dd401/attachment.txt>



More information about the Corpora mailing list