[Corpora-List] Sentence Splitter tool

Bill_Lang(Gmail) billlangjun at gmail.com
Mon Oct 29 12:06:47 CET 2007


Hi Naveed,

NLTK provides a class named as PunktSentenceTokenizer for sentence split. The iintroduction of it is as following:

Class PunktSentenceTokenizer

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

There is some demo code in python:

---------------------------------------------------------------------------- -----

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

fp = open("test.txt")

data = fp.read()

print '\n-----\n'.join(tokenizer.tokenize(data))

---------------------------------------------------------------------------- -----

_____

From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of Afzal, Naveed Sent: Monday, October 29, 2007 5:48 PM To: corpora at uib.no Subject: [Corpora-List] Sentence Splitter tool

I am looking for sentence splitter tool .... can any one help me out regarding this?

Thanks,

Naveed

-------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.uib.no/mailman/public/corpora/attachments/20071029/df540d65/attachment.html



More information about the Corpora mailing list