[Corpora-List] charset identifier

Simon Carter s.c.carter at uva.nl
Mon Apr 18 09:59:23 CEST 2011


Along the same lines of Mike Maxwell's contribution, there is a version of TextCat that uses information about encodings for language ID: (Languid) http://languid.cantbedone.org/ and http://search.cpan.org/~mceglows/Language-Guess-0.01/

Otherwise, this page may be of help http://odur.let.rug.nl/~vannoord/TextCat/competitors.html

Simon

On 17 Apr 2011, at 14:56, Mike Maxwell wrote:


> On 4/16/2011 6:51 AM, Joerg Tiedemann wrote:
>> Can someone point me to reliable (freely available) tools for
>> character set identification?
>> I would like to have a rather universal tool that can give me the used
>> char encoding for a given text and given the expected language of that
>> text.
>
> You might have a look at Kevin Scannel's site:
> http://borel.slu.edu/crubadan/stadas.html
> Not so much about character set identification as language ID. I'm not sure what he does about character codes, although I suppose one could create multiple clusters for a single language that uses multiple encoding systems. We did something like that some years back in the TIDES Surprise Language exercise for Hindi, where there were multiple proprietary encodings on the web, and very little Unicode-encoded text. Perhaps the situation has improved since then!
> --
> Mike Maxwell
> maxwell at umiacs.umd.edu
> "My definition of an interesting universe is
> one that has the capacity to study itself."
> --Stephen Eastmond
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

Simon Carter ISLA, Informatics Institute, University of Amsterdam, Science Park 107 1098 XG Amsterdam Phone: +31 (0)20 525 6731 Email: s.c.carter at uva.nl Web: www.scarter.org



More information about the Corpora mailing list