[Corpora-List] Tokenizer for English Web Corpus (and Email Data)

Andrew.Lampert at csiro.au Andrew.Lampert at csiro.au
Wed Mar 14 05:28:00 CET 2007

Further to Adriano's request below, is anyone aware of sentence
tokenizers/splitters that have been trained on or applied to email data?

Some of the noise in email text will be similar to that of web text
(emoticons, typos etc.), but there are also specific phenomena
(greetings, email signatures, dealing with quoted material etc.) that
seem to require techniques tailored to email.

I await your summary of responses with interest, Adriano.

Are there any additional pointers that people can offer, specifically
with regard to processing email text?

Andrew Lampert
Andrew Lampert
Research Engineer
Information Engineering Laboratory

Post: Locked Bag 17, North Ryde, NSW 1670, Australia
Office: Building E6B, Macquarie University, North Ryde, 2113
Tel: +61 2 9325 3129, Fax: +61 2 9325 3200


From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Adriano Ferraresi
Sent: Tuesday, 13 March 2007 10:40 PM
Subject: [Corpora-List] Tokenizer for English Web Corpus

Hi everybody,

I am currently embarking on a research project aiming at building a
large corpus of English by automatic crawls of the web. For this purpose
I would be interested in having some suggestions about an efficient
tokenizer for English. This should in some way take into account
specific aspects of Web writing (such as the treatment of emoticons,
typos, commonly used abbreviations, etc.). Does anyone know about a
similar tool?

I will provide a resume of the answers I (hopefully!) will get.

Thank you.

Adriano Ferraresi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.uib.no/public/corpora-archive/attachments/20070314/f72925ba/attachment.html

More information about the Corpora-archive mailing list