[Corpora-List] Encoding of apostrophes and quotes
lou.burnard at computing-services.oxford.ac.uk
Fri Jun 30 09:55:00 CEST 2006
> Would list members agree with the following statements:
> 1. Even though they look the same, apostrophe and single right quote behave
> as different characters and require different encoding.
I would say rather that the same graphic symbol has multiple
applications. There *is* a different character available for
representing "single right quote", of course, the one that looks like a
curly "smart quote".
> 2. An apostrophe is generally used to indicate elision or (in English)
> don't, 'tis, sayin', John's, James', c'est, geht's.
This is true, in English, certainly. But by no means the only use.
Consider the (infamous) use of the apostrophe to indicate plurals for
example ("PC's") or its use in French to indicate something about
pronunciation ("pin's") or its use in Italian to double up for an accent
Historically, I think, the apostrophe has the semantics of elision: we
use it in geneitive forms in English because of a (possibly mistaken)
etymological assumption ("man's" standing for "mannes" eg)
> In tokenization, the
> apostrophe is not to be dropped, but is retained as part of the token; and a
> token break may be considered somewhere in its vicinity.
Probably. In BNC our practice is to regard things like "That's" as two tokens "That" and "'s" so yes, we would certainly consider the apostrophe to be part of the second token. But others might treat this differently. We have exactly the same set of issues with the hyphen, of course.
a) it is sometimes used in place of the mdash
b) If "tea-pot" is treated as two tokens (rather than as a variant form of "teapot"), to which one does the hyphen belong?
> 3. A right single quote is used, in conjunction with a left single quote, to
> delimit a stretch of text. In tokenization, such marks (like punctuation
> in general) become separate tokens, and in many applications (such as
> word-lists) they are simply dropped.
Yes, but this is a different usage of the punctuation mark -- and one
which some (partly because of the ambiguity introduced) would castigate
> As someone who has always taken the above statements to be true, I have been
> amazed and disappointed to learn that Unicode advise the encoding of
> apostrophes and right single quotes as the same character (U+2019). Their
> explanation is that people in general will find it too difficult to
> understand the difference.
Well, I am amazed and disappointed to learn that you would expect
Unicode (who or whatever you mean by that) to legislate for such usage
rules. It's no part of their brief to tell us how to use glyphs which
have a long and (dis)honourable tradition of ambiguous usage!
> If I had followed this advice and used U+2019 for both apostrophe and right
> single quote, all the corpus analysis which I have successfully undertaken
> would have been made impossibly difficult.
Indeed, but then you've constructed the entirely accurate observation
that the apostrophe is often used ambiguously into a recommendation that
it should be!
I would say that the kind of usage you're talking about here (e.g. to
mark titles) ought to be carried out by proper descriptive markup. But
then I would, wouldn't I.
> In fact, even the simplest text
> processing exercise becomes impossible, see
> I would be interested to know what people think of Unicode's advice, and how
> they deal with this situation in practice.
> Ciarán Ó Duibhín.
> For completeness, though it doesn't affect the point above, I ought to add
> that Unicode *do* make a distinction between what they call "punctuation
> apostrophes" (the kind I have been talking about), and "letter apostrophes".
> They assign a character (U+02BC) to the latter, to be used in cases where an
> apostrophe look-alike is used to represent a sound (often, the glottal
More information about the Corpora-archive