[Corpora-List] UGC Tokenizer

Gustavo Laboreiro gustavo.laboreiro at gmail.com
Wed Jan 4 16:25:16 CET 2012

We would like to announce that we have made available our tokenizer for User- Generated Content.

The tokenizer is based on a text-classification approach, making it more robust than simpler rules-based approaches. You can find it described in this article: http://dl.acm.org/citation.cfm?id=1871853

You can find it at the following URL: http://labs.sapo.pt/up/2011/11/12/sylvester-ugc-tokenizer/

It is written in Python, but we include a script that shows a simple way to call it from Perl. Other languages can use similar approaches.

We expect that, with its simple interface and ready-made tools, it can be easily integrated into your processing pipelines.

Here are two examples:

#Normal use

from sylvester.tokenizer import Tokenizer t = Tokenizer() tokenized_message = t.tokenize( "original message" )

#Processing many messages

from sylvester.tokenizer import Tokenizer message_list = [ "message 1" , "message 2" , "message 3" ] t = Tokenizer( workers=4 ) # Quad-core machine tokenized_message_list = t.tokenize_list( message_list )

Our original focus was on Portuguese (the third most popular language in Twitter). By providing your own examples, you can re-train it for different languages or specific needs.

Comments, questions, suggestions or other feedback can be reported to gustavo.laboreiro at gmail.com

-- Gustavo Laboreiro

More information about the Corpora mailing list