[Corpora-List] charset identifier

Joerg Tiedemann jorg.tiedemann at lingfil.uu.se
Sat Apr 16 12:51:21 CEST 2011

Can someone point me to reliable (freely available) tools for character set identification? I would like to have a rather universal tool that can give me the used char encoding for a given text and given the expected language of that text. (Possibly with confidence values if available.)

I know about these tools (but I would also appreciate any comments about their quality):

enca: http://gitorious.org/enca This is exactly what I need but does not support a lot of languages. Maybe someone knows how to extend it with more languages/encodings?

utrac: http://utrac.sourceforge.net/ I haven't tested it but it seems to be quite restricted as well.

cpdetector: http://cpdetector.sourceforge.net/

https://github.com/goerz/convert_encoding.py includes a "guess encoding option":

These tools do not seem to be freely available: http://www.lingua-systems.com/language-identifier/lidc-application/ http://www.lingua-systems.com/unicode-converter/autouniconv-library/

The standard unix tool 'file' is of course also sometimes helpful but too restricted.

Is there anything else (that I can use without training specific models myself)?

Thanks! Jörg

-- **********************************************************************************  Jörg Tiedemann                                     jorg.tiedemann at lingfil.uu.se  Dep. of Linguistics and Philology http://stp.lingfil.uu.se/~joerg/  Uppsala University                                  tel:  +46 (0)18 - 471 1412  Box 635, SE-751 26 Uppsala/SWEDEN   fax: +46 (0)18 - 471 1094

More information about the Corpora mailing list