[Corpora-List] charset identifier

egon w. stemle egon.stemle at unitn.it
Sun Apr 17 00:13:02 CEST 2011


Joerg Tiedemann <jorg.tiedemann <at> lingfil.uu.se> writes:


>
> Can someone point me to reliable (freely available) tools for
> character set identification?
> I would like to have a rather universal tool that can give me the used
> char encoding for a given text and given the expected language of that
> text.
> (Possibly with confidence values if available.)
>
> I know about these tools (but I would also appreciate any comments
> about their quality):

...can't say anything about the quality, sorry. still, i'd go along the path of: A Composite Approach to Language/Encoding Detection [http://www.unicode.org/iuc/iuc19/a322.html], this is implemented in Mozilla's Universal Charset Detector [http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html , http://www.mozilla.org/projects/intl/detectorsrc.html] (the latter with some info on how to build standalone ones from their code.

then, here are some links to projects using the idea: [http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html], and specifically, this one [http://chardet.feedparser.org/] gives confidence values (and has a recent enough release date - 2009-11-10).

good luck!



More information about the Corpora mailing list