[Corpora-List] Encoding of apostrophes and quotes

Markus Saers masaers at gmail.com
Fri Jun 30 23:27:01 CEST 2006


And taking this statement to the area of orthography: different
languages uses different types of start-quote and end-quote characters
(as well as different lengths of dashes etc.). This makes the problem
language dependent, and assigning a specific orthographic character
the name "start-quote" is just plain wrong for most languages.
Characters should be named after how they appear visualy rather than
how they are used, since the latter is language specific.

However, "disambiguation" of characters into their current function
(e.g. period as "end of sentence", "end of abbreviation" or "part of
ellipsis" in English) is a subject that is highly relevant to any
corpus analysis.

Best regards
Markus Saers

2006/6/30, Seth Grimes <grimes at altaplana.com>:

> This may not concern any of you, but for what it's worth --

>

> In certain computer-programming shells (command-line interfaces), the

> back-slanted apostrophe, `, is used to contain a command fragment for

> execution. Here's a usage example:

>

> a=`ls -l`

>

> sets the value of the shell variable "a" to a directory listing produced

> by the command "ls -l". So if you're parsing certain texts and see a

> back-slanted apostrophe (left single quote), don't assume it starts a

> quotation that will be terminated by a forward-slanted apostrophe (right

> single quote).

>

> Seth

>

>

> On Fri, 30 Jun 2006, Thierry Fontenelle wrote:

>

> > I fully agree with Lou that elision is by no means the only use of the apostrophe. It's also used in Irish names like "O'Connors", "O'Hara"... Cases like "rock 'n roll" are also interesting... In French, it's indeed sometimes a marker of an elision ("l'école"), but it's also sometimes part of the token ("aujourd'hui", "prud'homme"...). We've even noticed that some people were using it to replace accents when they don't have a French keyboard (especially in instant messages: Ren'e instead of René). The decision to treat apostrophes as breaking or non-breaking characters has interesting implications for tools like spell-checkers (the same is true of hyphens, of course) and, like Marco Baroni yesterday, I'm glad to see that these crucial issues are discussed here and taken seriously... I wrote something about that on our blog a few months ago, for those of you who are interested...

> >

> >

> >

> > http://blogs.msdn.com/correcteurorthographiqueoffice/archive/2005/12/07/500807.aspx

> >

> >

> >

> >

> >

> > Thierry

> >

> >

> >

> > Thierry Fontenelle

> >

> > Microsoft Speech & Natural Language

> >

> >

> >

> >

> >

> > > 2. An apostrophe is generally used to indicate elision or (in English)

> >

> > > possession:

> >

> > > don't, 'tis, sayin', John's, James', c'est, geht's.

> >

> > This is true, in English, certainly. But by no means the only use.

> >

> > Consider the (infamous) use of the apostrophe to indicate plurals for example ("PC's") or its use in French to indicate something about pronunciation ("pin's") or its use in Italian to double up for an accent ("Forli'").

> >

> >

> >

> > -----Original Message-----

> > From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On Behalf Of Lou Burnard

> > Sent: Friday, June 30, 2006 12:52 AM

> > To: corpora at uib.no

> > Subject: Re: [Corpora-List] Encoding of apostrophes and quotes

> >

> >

> >

> >

> >

> > wrote:

> >

> > > Would list members agree with the following statements:

> >

> > >

> >

> > > 1. Even though they look the same, apostrophe and single right quote

> >

> > > behave as different characters and require different encoding.

> >

> > >

> >

> >

> >

> > I would say rather that the same graphic symbol has multiple applications. There *is* a different character available for representing "single right quote", of course, the one that looks like a curly "smart quote".

> >

> > > 2. An apostrophe is generally used to indicate elision or (in English)

> >

> > > possession:

> >

> > > don't, 'tis, sayin', John's, James', c'est, geht's.

> >

> > This is true, in English, certainly. But by no means the only use.

> >

> > Consider the (infamous) use of the apostrophe to indicate plurals for example ("PC's") or its use in French to indicate something about pronunciation ("pin's") or its use in Italian to double up for an accent ("Forli'").

> >

> >

> >

> > Historically, I think, the apostrophe has the semantics of elision: we use it in geneitive forms in English because of a (possibly mistaken) etymological assumption ("man's" standing for "mannes" eg)

> >

> > > In tokenization, the

> >

> > > apostrophe is not to be dropped, but is retained as part of the token;

> >

> > > and a token break may be considered somewhere in its vicinity.

> >

> > >

> >

> > Probably. In BNC our practice is to regard things like "That's" as two tokens "That" and "'s" so yes, we would certainly consider the apostrophe to be part of the second token. But others might treat this differently. We have exactly the same set of issues with the hyphen, of course.

> >

> >

> >

> > a) it is sometimes used in place of the mdash

> >

> > b) If "tea-pot" is treated as two tokens (rather than as a variant form of "teapot"), to which one does the hyphen belong?

> >

> >

> >

> >

> >

> >

> >

> > > 3. A right single quote is used, in conjunction with a left single quote, to

> >

> > > delimit a stretch of text. In tokenization, such marks (like punctuation

> >

> > > in general) become separate tokens, and in many applications (such as

> >

> > > word-lists) they are simply dropped.

> >

> > >

> >

> > >

> >

> > Yes, but this is a different usage of the punctuation mark -- and one

> >

> > which some (partly because of the ambiguity introduced) would castigate

> >

> > as mistaken!

> >

> >

> >

> >

> >

> > > As someone who has always taken the above statements to be true, I have been

> >

> > > amazed and disappointed to learn that Unicode advise the encoding of

> >

> > > apostrophes and right single quotes as the same character (U+2019). Their

> >

> > > explanation is that people in general will find it too difficult to

> >

> > > understand the difference.

> >

> > >

> >

> > >

> >

> > Well, I am amazed and disappointed to learn that you would expect

> >

> > Unicode (who or whatever you mean by that) to legislate for such usage

> >

> > rules. It's no part of their brief to tell us how to use glyphs which

> >

> > have a long and (dis)honourable tradition of ambiguous usage!

> >

> >

> >

> >

> >

> > > If I had followed this advice and used U+2019 for both apostrophe and right

> >

> > > single quote, all the corpus analysis which I have successfully undertaken

> >

> > > would have been made impossibly difficult.

> >

> >

> >

> > Indeed, but then you've constructed the entirely accurate observation

> >

> > that the apostrophe is often used ambiguously into a recommendation that

> >

> > it should be!

> >

> >

> >

> > I would say that the kind of usage you're talking about here (e.g. to

> >

> > mark titles) ought to be carried out by proper descriptive markup. But

> >

> > then I would, wouldn't I.

> >

> >

> >

> >

> >

> > > In fact, even the simplest text

> >

> > > processing exercise becomes impossible, see

> >

> > > http://www.smo.uhi.ac.uk/~oduibhin/apostrophe.htm <http://www.smo.uhi.ac.uk/~oduibhin/apostrophe.htm> .

> >

> > >

> >

> > > I would be interested to know what people think of Unicode's advice, and how

> >

> > > they deal with this situation in practice.

> >

> > >

> >

> > > Ciarán Ó Duibhín.

> >

> > >

> >

> > > For completeness, though it doesn't affect the point above, I ought to add

> >

> > > that Unicode *do* make a distinction between what they call "punctuation

> >

> > > apostrophes" (the kind I have been talking about), and "letter apostrophes".

> >

> > > They assign a character (U+02BC) to the latter, to be used in cases where an

> >

> > > apostrophe look-alike is used to represent a sound (often, the glottal

> >

> > > stop).

> >

> > >

> >

> > >

> >

> > >

> >

> > >

> >

> > >

> >

> > >

> >

> >

> >

> >

> >

> >

>

> --

> Seth Grimes Alta Plana Corp, analytical computing & data management

> Intelligent Enterprise magazine (CMP), Contributing Editor

> grimes at altaplana.com http://altaplana.com 301-270-0795

>

>






More information about the Corpora-archive mailing list