[Corpora-List] Sentence segmenting

Thanh-Le Ha leht82 at gmail.com
Tue Aug 14 10:35:58 CEST 2012


Hi Jeff,

I also tried the pre-trained sentence segmentation of NLTK before and did not satisfy with the quality either. I turned to Splitta ( http://code.google.com/p/splitta/), mentioned by Aleksandar above and it's really good for English. It haven't trained on other languages, though, but for your requirements, I think Splitta is worth to try.

--Le.

On Tue, Aug 14, 2012 at 10:17 AM, Steven Bird <sb at csse.unimelb.edu.au>wrote:


> On 13 August 2012 23:35, Jeff Elmore <jelmore at lexile.com> wrote:
> > I have checked
> > out what NLTK offers but from what I've seen there's not anything
> terribly
> > accurate in it (fails on obvious common cases like some honorifics).
>
> Note that NLTK just uses Punkt, and this won't necessarily perform
> well if it uses an off-the-shelf model that was trained on data that
> contained different abbreviations to the test data:
>
> "Punkt is designed to learn parameters (a list of abbreviations, etc.)
> unsupervised from a corpus similar to the target domain. The
> pre-packaged models may therefore be unsuitable: use
> PunktSentenceTokenizer(text) to learn parameters from the given text."
> http://nltk.org/api/nltk.tokenize.html
>
> -Steven Bird
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2211 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20120814/4958eed7/attachment.txt>



More information about the Corpora mailing list