[Corpora-List] Encoding of apostrophes and quotes

Hardie, Andrew a.hardie at lancaster.ac.uk
Fri Jun 30 14:39:02 CEST 2006

Another use of apostrophe to add to the pile: I've often encountered two in a row used instead of a double quote.

Other alphabetic systems provide us with good examples of what happens when there are two Unicode characters that look identical or very similar but are supposed to be separate things: they get mixed up, both by typists and by software designers. For instance, in Urdu texts the pair alef maksura (0649) and farsi yeh (06cc) often get confused, as do farsi yeh (06cc) and yeh barree (06d2) in some positions, as do kaf (0643) and keheh (06A9). In Devanagari and similar alphabets I have likewise encountered confusion between visarga (0903, etc) and colon (003a), and between danda (0964) and the vertical line (007c). Note that these aren't even identical in appearance, just near identical, and they get confused. So I also think the Unicode Standard is right not to demand that a much finer distinction be made with the apostrophe/single quote.


Andrew Hardie
Department of Linguistics
Bowland College
Lancaster University
Lancaster LA1 4YT

a.hardie at lancaster.ac.uk <mailto:a.hardie at lancaster.ac.uk>


From: owner-corpora at lists.uib.no on behalf of Marco Baroni
Sent: Fri 30/06/2006 07:55
To: Ciarán Ó Duibhín; CORPORA at UIB.NO
Subject: Re: [Corpora-List] Encoding of apostrophes and quotes

I think that, if the people who produce the texts we parse do not make a
distinction coherently, we might as well forget about it, as it will just
create more noise (I myself have just found out now how to produce a single
quote on my keyboard -- never typed a single quote character before...)

More information about the Corpora-archive mailing list