[Corpora-List] charset identifier

Mike Maxwell maxwell at umiacs.umd.edu
Sun Apr 17 14:56:06 CEST 2011

On 4/16/2011 6:51 AM, Joerg Tiedemann wrote:
> Can someone point me to reliable (freely available) tools for
> character set identification?
> I would like to have a rather universal tool that can give me the used
> char encoding for a given text and given the expected language of that
> text.

You might have a look at Kevin Scannel's site:

http://borel.slu.edu/crubadan/stadas.html Not so much about character set identification as language ID. I'm not sure what he does about character codes, although I suppose one could create multiple clusters for a single language that uses multiple encoding systems. We did something like that some years back in the TIDES Surprise Language exercise for Hindi, where there were multiple proprietary encodings on the web, and very little Unicode-encoded text.

Perhaps the situation has improved since then! --

Mike Maxwell

maxwell at umiacs.umd.edu

"My definition of an interesting universe is

one that has the capacity to study itself."

--Stephen Eastmond

More information about the Corpora mailing list