[Corpora-List] Encoding of apostrophes and quotes

Ron Artstein artstein at essex.ac.uk
Fri Jun 30 23:40:02 CEST 2006

> As someone who has always taken the above statements to be true,

> I have been amazed and disappointed to learn that Unicode advise

> the encoding of apostrophes and right single quotes as the same

> character (U+2019).

My understanding is that Unicode tends to unify characters that
always look the same. Since an apostrophe and a closing quote use
identical glyphs whatever the font, they get the same character;
in contrast, a comma and a baseline quote may have identical glyphs
in some fonts but distinct glyphs in other fonts, so they get
separate characters.

One thing that has always baffled me was why Unicode decided to
assign the two characters U+05F3 Hebrew punctuation geresh and
U+05F4 Hebrew punctuation gershayim. Geresh (dual: gershayim) is
the Hebrew name for a punctuation mark similar to an apostrophe
which is used for marking abbreviations; in modern usage these have
identical glyphs to single and double quotes. I haven't found an
explanation why U+05F3 and U+05F4 are distinct from standard
punctuation marks, and whether they're intended just for
abbreviations or also for quotes.

My guess is that separate code points were needed because Hebrew
apostrophes and quotes are quite distinct in shape from Latin ones;
a mixed font could share code points (and glyphs) for most
punctuation marks, but using the Latin glyphs for quotes and
apostrophes in Hebrew would look very odd. If this is indeed the
rationale behind the code points U+05F3 and U+05F4, then these
characters should be used for both apostrophes and quotes in


More information about the Corpora-archive mailing list