[Corpora-List] JusText

maxwell maxwell at umiacs.umd.edu
Fri Oct 5 18:06:50 CEST 2018


On 2018-10-05 03:17, Alexander Osherenko wrote:
> Mark, I do know the charmap problem you are talking about from NLTK. In
> my
> case, there were also problems with inputs to encode containing French
> characters and I got the message "character XXX in position YYY can't
> be
> encoded using the ZZZ encoding". As far as I know it is a bug, but I
> didn't
> want to fix it.

I haven't used charmap, so I just have some general questions/ suggestions. In my own experience, this error usually means that Python is expecting UTF-8, but got some non-ASCII ISO-8859 characters--or in this case, if it's calling CP1252.py, then it sounds like it's trying to decode some text as if it were cp1252. 0x81 is not a valid code point in cp1252, but it is a valid first byte of a two-byte UTF-8 encoding for U+00C1 (the upper case A with acute), which appears in the original article. So it sounds like the program is trying to interpret UTF-8 text as if it were cp1252.

Which version of Python were you using? There were, as I'm sure you know, significant changes in the handling of Unicode (and other encodings) between Python 2 and 3; this sounds like it might be a 2 vs. 3 issue.

Mike Maxwell

University of Maryland



More information about the Corpora mailing list