El mar, 29-07-2008 a las 10:19 -0400, David Graff escribió:
> Serge,
>
> I'd be interested in learning about any examples you've seen to the
> contrary, but for the most part, there are basically two choices for
> encoding Arabic web pages: single-byte and utf-8.
If you only need to detect between single-byte and UTF-8, the unix utility "file" should suffice:
$ wget -q -O - http://www.bbc.co.uk/arabic | sed 's/<.*>//g' | file - /dev/stdin: ISO-8859 text, with very long lines, with CRLF, LF line terminators
$ wget -q -O - http://ar.wikipedia.org | sed 's/<.*>//g' | file - /dev/stdin: UTF-8 Unicode text, with very long lines
This is pretty crude but it seems to work with the few examples I've tried.
Fran