[Corpora-List] Encoding of apostrophes and quotes

Ciarán Ó Duibhín ciaran at oduibhin.freeserve.co.uk
Fri Jun 30 02:55:00 CEST 2006

Would list members agree with the following statements:

1. Even though they look the same, apostrophe and single right quote behave
as different characters and require different encoding.

2. An apostrophe is generally used to indicate elision or (in English)
don't, 'tis, sayin', John's, James', c'est, geht's. In tokenization, the
apostrophe is not to be dropped, but is retained as part of the token; and a
token break may be considered somewhere in its vicinity.

3. A right single quote is used, in conjunction with a left single quote, to
delimit a stretch of text. In tokenization, such marks (like punctuation
in general) become separate tokens, and in many applications (such as
word-lists) they are simply dropped.

As someone who has always taken the above statements to be true, I have been
amazed and disappointed to learn that Unicode advise the encoding of
apostrophes and right single quotes as the same character (U+2019). Their
explanation is that people in general will find it too difficult to
understand the difference.

If I had followed this advice and used U+2019 for both apostrophe and right
single quote, all the corpus analysis which I have successfully undertaken
would have been made impossibly difficult. In fact, even the simplest text
processing exercise becomes impossible, see

I would be interested to know what people think of Unicode's advice, and how
they deal with this situation in practice.

Ciarán Ó Duibhín.

For completeness, though it doesn't affect the point above, I ought to add
that Unicode *do* make a distinction between what they call "punctuation
apostrophes" (the kind I have been talking about), and "letter apostrophes".
They assign a character (U+02BC) to the latter, to be used in cases where an
apostrophe look-alike is used to represent a sound (often, the glottal

More information about the Corpora-archive mailing list