[Corpora-List] Grep for Unicode (was: Grep for Windows)

Tony Abou-Assaleh taa at acm.org
Sun Dec 17 16:30:01 CET 2006

Gnu grep 2.5.1 is way outdated. There have been numerous fixes and
improvements to the CVS version in the past 2 years, but somehow the
maintainers couldn't find time to make a formal release.

Gnu grep CVS [1] fixes the inefficiency in handling unicode, handles
unicode well, and the local is specified in environment variables. To my
knowledge, no automatic detection of input encoding is done. More
information about unicode support in Gnu grep could be obtained by
contacting the dev team [2].

[1] https://savannah.gnu.org/projects/grep/
[2] Mailing list info at http://www.gnu.org/software/grep/



Tony Abou-Assaleh
Email: taa at acm.org
Web site: http://tony.abou-assaleh.net
----------------------[THE END]----------------------

On Sat, 16 Dec 2006, Mike Maxwell wrote:

> Rob Malouf wrote:

> > On Dec 15, 2006, at 8:36 AM, maxwell at ldc.upenn.edu wrote:

> >> Besides, none of the standard grep implementations that I know of

> >> handle Unicode (at least not in any useful way).

> >

> > Gnu grep 2.5.1 supports Unicode, though I guess it's debatable just how

> > useful it is. The next version is supposed to be much better on that

> > front.


> I suspect this has been hashed over somewhere, and if so just point me

> in the right direction. But I don't see the string 'unicode' (upper or

> lower case) anywhere in the Gnu grep 2.5.1 that I just downloaded, save

> in the .po files (which are messages, and haven't been updated in a long

> time anyway). I did google some Red Hat info on updates to grep, which

> do speak about a Unicode issue (apparently an earlier version had an

> extreme inefficiency in the way it searched UTF-8 streams). Since I

> thought Linux distros usually came with the GNU tools, I'm a little puzzled.


> Stepping back a bit: I can think of two ways one might want to use grep

> with Unicode files.


> One is to search for a particular byte sequence, and I presume grep can

> do that.


> The other is to search for a particular character sequence. For that,

> two things seem to be necessary: it needs to know the encoding of the

> incoming stream (UTF-8, UTF-16 big-end/little-end,...), and it needs to

> handle normalization. (And it needs to know what to do with these in

> the output.) I think the normalization issue is doable, provided the

> encoding issue is correctly handled. But there are numerous issues with

> determining the encoding of an input stream, and I'm not knowledgeable

> enough to know whether it is always possible to reliably tell from

> looking at a stream of bytes which one knows to be Unicode which

> encoding it is.


> At any rate, I don't see anything that tells me how Gnu grep deals with

> Unicode encodings and normalization. Am I missing something?

> --

> Mike Maxwell

> maxwell at ldc.upenn.edu


More information about the Corpora-archive mailing list