[Corpora-List] Encoding of apostrophes and quotes

Mike Maxwell maxwell at ldc.upenn.edu
Fri Jun 30 14:01:00 CEST 2006


Ciarán Ó Duibhín wrote:


> 1. Even though they look the same, apostrophe and single right quote behave

> as different characters and require different encoding.


Similarly, the period character (full stop for you British types :-))
has at least the following uses in English:

1) end of declarative sentence

2) end of abbreviation

3) decimal point

4) character in ellipsis (...)

Sometimes a single period has more than one of the above functions, e.g.
when an abbreviation ends a sentence. This is very common with the
abbreviation etc.

Only (4) has a separate representation in Unicode (and some other
encodings), namely as an ellipsis (i.e. all three dots as a single
character).

But I can't imagine people having to use a separate character for the
other three functions (and perhaps still another character for when the
period has more than one function).

The characters are for the benefit of the reader, not for corpus
linguists. We have to make do with whatever the readers do.

Mike Maxwell
CASL/ U MD





More information about the Corpora-archive mailing list