[Corpora-List] Encoding of apostrophes and quotes

Marco Baroni baroni at sslmit.unibo.it
Fri Jun 30 08:57:01 CEST 2006

Hi there.

First of all, I am really glad that for once we discuss this kind of
"low-level" processing issues that are so fundamental to getting high
quality language data, but that are often not taken seriously as dignified
research topics...

> As someone who has always taken the above statements to be true, I have been

> amazed and disappointed to learn that Unicode advise the encoding of

> apostrophes and right single quotes as the same character (U+2019). Their

> explanation is that people in general will find it too difficult to

> understand the difference.

I think that, if the people who produce the texts we parse do not make a
distinction coherently, we might as well forget about it, as it will just
create more noise (I myself have just found out now how to produce a single
quote on my keyboard -- never typed a single quote character before...)

If I get a text to tokenize, unless I have a lot of reliable information
about how it was produced (which in my experience is never the case), I
just merge all single quote/apostrophe-like characters, and then use
various heuristics to decide which ones are apostrophes, which ones are
single quotes, and which ones mark an accent on the previous vowel (since
this is another way in which the apostrophe is used in electronic Italian).

Add to that that a lot of standard tools to process Western European text
(such as the IMS treetaggers) expect latin1 input, and thus they will not
be able to make the distinction anyway (last time I checked, at least...)

My pessimistic 2 cents.



Marco Baroni
SSLMIT, University of Bologna

Leadership is a form of evil. No one needs to lead you to do something
that is obviously good for you.

(Scott Adams)

More information about the Corpora-archive mailing list