You might have a look at Kevin Scannel's site:
http://borel.slu.edu/crubadan/stadas.html Not so much about character set identification as language ID. I'm not sure what he does about character codes, although I suppose one could create multiple clusters for a single language that uses multiple encoding systems. We did something like that some years back in the TIDES Surprise Language exercise for Hindi, where there were multiple proprietary encodings on the web, and very little Unicode-encoded text.
Perhaps the situation has improved since then! --
maxwell at umiacs.umd.edu
"My definition of an interesting universe is
one that has the capacity to study itself."