[Corpora-List] Sentence Splitter tool

Eva Forsbom evafo at stp.lingfil.uu.se
Mon Oct 29 12:40:20 CET 2007


Hi Naveed,


> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf
> Of Afzal, Naveed Sent: 29 October 2007 09:48 To: corpora at uib.no Subject:
> [Corpora-List] Sentence Splitter tool
>
> I am looking for sentence splitter tool .... can any one help me out
> regarding this?
>
> Thanks,
> Naveed

This question (and questions on tokenisers) have been asked before on this list. I collected snippets of some of the answers for my own benefit, and enclose it below, in the hope that it may be of some use.

Best regards, Eva

-- ------------------------------------- Eva Forsbom, Uppsala University/GSLT E-mail: evafo at stp.lingfil.uu.se URL: http://stp.lingfil.uu.se/~evafo/ Telephone: +46 (0)18 471 70 06 Fax: +46 (0)18 471 14 16 Address: Dept. of Linguistics and Philology Box 635 SE-751 26 Uppsala SWEDEN

Snippets on sentence splitters and tokenisers collected from corpora list and elsewhere:

Dan Roth:

A pretty good sentence splitter can be downloaded also from http://L2R.cs.uiuc.edu/~cogcomp/cc-software.html

Miles Osborne:

if you mean code for segmenting text into sentences, then here are a few links:

Adwait Ratnaparkhi's MXTERMINATOR:

http://www.cis.upenn.edu/~adwait/statnlp.html

the LTG TTT system might be useful:

http://www.ltg.ed.ac.uk/software/ttt/index.html

Tony Rose :

There's a simple perl5 sentence splitter available at: http://search.cpan.org/author/TGROSE/HTML-Summary-0.017/

Don't know about good, but it's certainly free :)

Joerg Schuster:

I have also asked for sentencizers very recently. Here is a summary:

Name/Nickname |Author |Web Site |Comment |

* ave |Ave Wrigley |http://search.cpan.org/author/TGROSE/HTML-Summary-0.017/|perl module

* mxterminator |Adwait |http://www.cis.upenn.edu/~adwait/statnlp.html |java, | | |Ratnaparkhi | |probabilistic|

* satz |David |http://elib.cs.berkeley.edu/src/satz/ |written in c,| | |D. Palmer | |has to be trained |

* sentence.cgi |? |http://misshoover.si.umich.edu/~zzheng/sentence/ |cgi script |

* shlomo |Shlomo Yona |http://search.cpan.org/author/SHLOMOY/ |perl module | | | |Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm | |

* ttt |? |http://www.ltg.ed.ac.uk/software/ttt/index.html |Seems to be available only for SPARC machines |

You can test the programs ave, mxterminator and shlomo here: http://www.cis.uni-muenchen.de/~js/sentencize

If you do non-trivial tests, please let me know the results.

Staffan Hermansson:

Hello people. Here's a brief summary of the things I've recieved. Some people were nice enough to attach documents. I've located most of those on the web for you. Again, thank you for your support.

Applications: A free CPAN Perl module for sentence splitting. http://listserv.linguistlist.org/cgi-bin/wa?A2=ind0302&L=corpora&P=R5743 Shlomo Yona maintains another perl-based sentence splitter. http://cs.haifa.ac.il/~shlomo/ Earlier posts on this list (might have missed some): http://helmer.aksis.uib.no/corpora/1998-4/0026.html http://helmer.aksis.uib.no/corpora/1999-3/0347.html http://helmer.aksis.uib.no/corpora/2000-2/0225.html http://helmer.aksis.uib.no/corpora/2003-1/0140.html

Reports:

Ghassan Mourad was nice and attached the following to me. Though I can't read a word in French (thanks anyway), it might still be of interrest.

* Ghassan Mourad (1999) La segmentation de textes par l'étude de la ponctuation http://www.lalic.paris4.sorbonne.fr/articles/1998-1999/Mourad/CIDE99.pdf

* Ghassan Mourad La segmentation de textes par exploration contextuelle automatique, présentation du module SegATex

* Greg Grefenstette and Past Tapanainen. "What is a word, what is a sentence? Problems of tokenization." http://citeseer.nj.nec.com/grefenstette94what.html

* Tibor Kiss and Jan Strunk Scaled log likelihood ratios for the detection of abbreviations in text corpora http://www.linguistics.rub.de/~kiss/publications/abbrev.pdf

* Tibor Kiss and Jan Strunk Multilingual Least-Effort Sentence Boundary Disambiguation http://www.linguistics.rub.de/~kiss/publications/publications.html#boundaries

* Andrei Mikheev. "Text Segmentation." In R. Mitkov (ed.) Oxford Handbook of Computational Linguistics, OUP, 2003.

* Andrei Mikheev Tagging Sentence Boundaries (2000) http://citeseer.nj.nec.com/mikheev00tagging.html

* Andrei Mikheev Periods, Capitalized Words, etc (1999) http://citeseer.nj.nec.com/mikheev99periods.html

* David D. Palmer (2000) Tokenisation and Sentence Segmentation, Robert Dale, Hermann Moisl and Harold Somers (Eds) in A Handbook of Natural Language Processing, Marcel Dekker David D. Palmer and Marti A. Hearst, Adaptive Multilingual Sentence Boundary Disambiguation http://citeseer.nj.nec.com/palmer97adaptive.html

* J. Reynar and A. Ratnaparkhi, A Maximum Entropy Approach to Identifying Sentence Boundaries http://citeseer.nj.nec.com/article/reynar97maximum.html

http://www.cs.rochester.edu/u/tetreaul/academic.html

1. Sentence Splitters

* Satz Adaptive sentence boundary detector (C) (David Palmer and Marti Hearst)

* Dan Roth's splitter

* shlomoy Perl5 splitter

* tgrose: sentence perl module

* MXTERMINATOR (Adwait Ratnaparkhi)

* LGT TTT system

* Zhiping Zheng's cgi splitter

* Guenther cgi script

* Interactive Sentence Aligner (Joerg Tiedemann)

* Russian Sentence C++ Splitter (download) dll is here

* English rule-based Java sentence splitter (Scott Piao)

(links)

check the corpora-list archives:

http://listserv.linguistlist.org/cgi-bin/wa?S1=corpora

Patrick Tschorn:

I am pleased to announce the immediate availability of Sentrick, a sentence

boundary detection program for German.

http://www.denkselbst.de/sbdniffler/sentrick.html

Sentrick requires Java 1.5, processes plain text, handles a variety of punctuation

characters (including quotes) and is licensed under the GNU GPL.

Scott Songlin Piao:

I put my English sentence splitor on the website:

http://text0.mib.man.ac.uk:8080/sentencebreaker/heuristic_tool

It is rule-based Java program and is downloadable.

I put my sentence breaker at the site:

http://text0.mib.man.ac.uk:8080/scottpiao/sent_detector

It has performed with very high precisions, including in a commercial

context. It is for English, I am not sure if it works on Spanish. You

can try on the

demo website.

Jason Baldridge

One fairly easy to use sentence boundary detector and tokenizer is

included in the OpenNLP toolkit:

http://opennlp.sf.net

It is written in Java and is basically the same as Ratnaparkhi's

detector. Lots of other tools, including parsing, tagging, and

coreference are in that package. There are already trained models

available for English. The tools themselves are not language specific, so

if you provide an appropriate training corpus in Spanish, you can train

new models easily enough. (And the code is open source, so you can modify

it to make it more sensitive to another language ( e.g., morphology) if

you want.)

For other tools, many of which are geared for Spanish NLP, you might also

have a look at FreeLing:

http://garraf.epsevg.upc.es/freeling/

There are certainly many other tools available it is actually pretty

straightforward to whip up a detector from scratch. There are some recent

unsupervised approaches for sentence boundary detection too that could be

relevant for you. You might have a look at this article by Tibor Kiss and

Jan Strunk:

http://www.linguistics.ruhr-uni-bochum.de/~strunk/ks2005FINAL.pdf

Steven Bird:

On 7/21/07, Jason Baldridge wrote:

There are some recent

unsupervised approaches for sentence boundary detection too that could be

relevant for you. You might have a look at this article by Tibor Kiss and

Jan Strunk:

http://www.linguistics.ruhr-uni-bochum.de/~strunk/ks2005FINAL.pdf

Note that the Punkt system has been ported to Python and is included with the

Natural Language Toolkit (http://nltk.org/index.php), in module

nltk_contrib.punkt

Andy Roberts:

It's not been under any manjor evaluation by myself, but my jTokeniser

Java library has a sentence segmentation module. I'm utilising Java's

built-in text processing libraries (which were donated by IBM's ICU4J

project) to do all the hard work.

See http://www.andy-roberts.net/software/jTokeniser/

There's also a GUI available for you to test the various tokenisers

interactively.

Katrin Tomanek:

we have a ML-based sentence splitter/tokenizer. Both are little bit

optimized for the bio-medical domain (english), but are of course (given

you have the training material) applicable to other domains.

Both tools are available in a command-line mode and as UIMA components.

They can be downloaded from our website: http://julielab.de. You will

find a reference to our paper on these tools (MEDINFO 2007) on the

website as well.

Kevin B. Cohen:

We had good luck with Andy's jTokeniser in a corpus refactoring

project recently. The inputs were biomedical texts, which present

some unique weirdness, and it performed well. I don't have

quantitative data. We *do* have some quantitative data on the

performance of the LingPipe sentence splitter, and it performs very

nicely in head-to-head comparisons with other systems.

Mehmet Kayaalp:

Last year, we examined 13 open source, freeware software packages, which can

perform NL tokenization (many of which perform sentence boundary detection

and more) and summarized our experience in a technical report, which is

accessible at http://lhncbc.nlm.nih.gov/lhc/docs/reports/2006/tr2006003.pdf.



More information about the Corpora mailing list