[Corpora-List] JusText

Alexander Osherenko osherenko at gmx.de
Fri Oct 5 18:55:02 CEST 2018


I doubt I didn't notice the 2 vs 3 issue since I am working on parsing malicious strings for a long time, but anyhow I am using Python 3.4.

-- Alexander Osherenko, Dr. rer. nat. Senior HCI architect Founder and R&D Socioware Development <http://www.socioware.de/osherenko_page.html> Profile: ResearchGate <https://www.researchgate.net/profile/Alexander_Osherenko> Implementing Social Smart Environments with a Large Number of Believable Inhabitants in the Context of Globalization <https://www.researchgate.net/publication/327425719_Implementing_Social_Smart_Environments_with_a_Large_Number_of_Believable_Inhabitants_in_the_Context_of_Globalization> at Springer

Am Fr., 5. Okt. 2018 um 18:06 Uhr schrieb maxwell <maxwell at umiacs.umd.edu>:


> On 2018-10-05 03:17, Alexander Osherenko wrote:
> > Mark, I do know the charmap problem you are talking about from NLTK. In
> > my
> > case, there were also problems with inputs to encode containing French
> > characters and I got the message "character XXX in position YYY can't
> > be
> > encoded using the ZZZ encoding". As far as I know it is a bug, but I
> > didn't
> > want to fix it.
>
> I haven't used charmap, so I just have some general questions/
> suggestions. In my own experience, this error usually means that Python
> is expecting UTF-8, but got some non-ASCII ISO-8859 characters--or in
> this case, if it's calling CP1252.py, then it sounds like it's trying to
> decode some text as if it were cp1252. 0x81 is not a valid code point
> in cp1252, but it is a valid first byte of a two-byte UTF-8 encoding for
> U+00C1 (the upper case A with acute), which appears in the original
> article. So it sounds like the program is trying to interpret UTF-8
> text as if it were cp1252.
>
> Which version of Python were you using? There were, as I'm sure you
> know, significant changes in the handling of Unicode (and other
> encodings) between Python 2 and 3; this sounds like it might be a 2 vs.
> 3 issue.
>
> Mike Maxwell
> University of Maryland
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3303 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181005/ef2440c7/attachment.txt>



More information about the Corpora mailing list