[Corpora-List] Arabic encoding guesser

Francis Tyers ftyers at prompsit.com
Tue Jul 29 16:51:33 CEST 2008


El mar, 29-07-2008 a las 10:19 -0400, David Graff escribió:
> Serge,
>
> I'd be interested in learning about any examples you've seen to the
> contrary, but for the most part, there are basically two choices for
> encoding Arabic web pages: single-byte and utf-8.

If you only need to detect between single-byte and UTF-8, the unix utility "file" should suffice:

$ wget -q -O - http://www.bbc.co.uk/arabic | sed 's/<.*>//g' | file - /dev/stdin: ISO-8859 text, with very long lines, with CRLF, LF line terminators

$ wget -q -O - http://ar.wikipedia.org | sed 's/<.*>//g' | file - /dev/stdin: UTF-8 Unicode text, with very long lines

This is pretty crude but it seems to work with the few examples I've tried.

Fran



More information about the Corpora mailing list