This is probably just a Windows thing, but I figured out today that the problem is solved by converting all files to ANSI before processing them with JusText. That's obviously not going to work for all languages and it would probably confuse things for other OS's, but it works quite well for my English language texts on a Windows machine.
Again, thanks for the feedback.
============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================
________________________________ From: Alexander Osherenko <osherenko at gmx.de> Sent: Friday, October 5, 2018 10:55 AM To: maxwell Cc: Mark Davies; Corpora at uib.no Subject: Re: [Corpora-List] JusText
I doubt I didn't notice the 2 vs 3 issue since I am working on parsing malicious strings for a long time, but anyhow I am using Python 3.4.
-- Alexander Osherenko, Dr. rer. nat. Senior HCI architect Founder and R&D Socioware Development<http://www.socioware.de/osherenko_page.html> Profile: ResearchGate<https://www.researchgate.net/profile/Alexander_Osherenko> Implementing Social Smart Environments with a Large Number of Believable Inhabitants in the Context of Globalization<https://www.researchgate.net/publication/327425719_Implementing_Social_Smart_Environments_with_a_Large_Number_of_Believable_Inhabitants_in_the_Context_of_Globalization> at Springer
Am Fr., 5. Okt. 2018 um 18:06 Uhr schrieb maxwell <maxwell at umiacs.umd.edu<mailto:maxwell at umiacs.umd.edu>>:
On 2018-10-05 03:17, Alexander Osherenko wrote:
> Mark, I do know the charmap problem you are talking about from NLTK. In
> case, there were also problems with inputs to encode containing French
> characters and I got the message "character XXX in position YYY can't
> encoded using the ZZZ encoding". As far as I know it is a bug, but I
> want to fix it.
I haven't used charmap, so I just have some general questions/ suggestions. In my own experience, this error usually means that Python is expecting UTF-8, but got some non-ASCII ISO-8859 characters--or in this case, if it's calling CP1252.py, then it sounds like it's trying to decode some text as if it were cp1252. 0x81 is not a valid code point in cp1252, but it is a valid first byte of a two-byte UTF-8 encoding for U+00C1 (the upper case A with acute), which appears in the original article. So it sounds like the program is trying to interpret UTF-8 text as if it were cp1252.
Which version of Python were you using? There were, as I'm sure you know, significant changes in the handling of Unicode (and other encodings) between Python 2 and 3; this sounds like it might be a 2 vs. 3 issue.
University of Maryland -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5306 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181006/eefdd989/attachment.txt>