Am Sa., 6. Okt. 2018 um 22:14 Uhr schrieb Miloš Jakubíček < milos.jakubicek at sketchengine.co.uk>:
> Hi Mark,
> can I just double check two things:
> - where does this version of jusText come from (there are waaaay too many
> forks available)?
> - whom did you address your mail to, when you emailed the developers?
> The original version of jusText was designed for Python 2, not Python 3,
> so the first thing to try would be that (the default string handling has
> changed between Pyhon 2 and Python 3).
> Milos Jakubicek
> CEO, Lexical Computing
> Brno, CZ | Brighton UK
> On Sat, 6 Oct 2018 at 05:44, Mark Davies <Mark_Davies at byu.edu> wrote:
>> Thanks for the feedback re. encoding errors with JusText.
>> This is probably just a Windows thing, but I figured out today that the
>> problem is solved by converting all files to ANSI before processing them
>> with JusText. That's obviously not going to work for all languages and it
>> would probably confuse things for other OS's, but it works quite well for
>> my English language texts on a Windows machine.
>> Again, thanks for the feedback.
>> Mark Davies
>> Mark Davies
>> Professor of Linguistics / Brigham Young University
>> ** Corpus design and use // Linguistic databases **
>> ** Historical linguistics // Language variation **
>> ** English, Spanish, and Portuguese **
>> *From:* Alexander Osherenko <osherenko at gmx.de>
>> *Sent:* Friday, October 5, 2018 10:55 AM
>> *To:* maxwell
>> *Cc:* Mark Davies; Corpora at uib.no
>> *Subject:* Re: [Corpora-List] JusText
>> I doubt I didn't notice the 2 vs 3 issue since I am working on parsing
>> malicious strings for a long time, but anyhow I am using Python 3.4.
>> Alexander Osherenko, Dr. rer. nat.
>> Senior HCI architect
>> Founder and R&D
>> Socioware Development <http://www.socioware.de/osherenko_page.html>
>> Profile: ResearchGate
>> Implementing Social Smart Environments with a Large Number of Believable
>> Inhabitants in the Context of Globalization
>> <https://www.researchgate.net/publication/327425719_Implementing_Social_Smart_Environments_with_a_Large_Number_of_Believable_Inhabitants_in_the_Context_of_Globalization> at
>> Am Fr., 5. Okt. 2018 um 18:06 Uhr schrieb maxwell <maxwell at umiacs.umd.edu
>>> On 2018-10-05 03:17, Alexander Osherenko wrote:
>>> > Mark, I do know the charmap problem you are talking about from NLTK.
>>> > my
>>> > case, there were also problems with inputs to encode containing French
>>> > characters and I got the message "character XXX in position YYY can't
>>> > be
>>> > encoded using the ZZZ encoding". As far as I know it is a bug, but I
>>> > didn't
>>> > want to fix it.
>>> I haven't used charmap, so I just have some general questions/
>>> suggestions. In my own experience, this error usually means that Python
>>> is expecting UTF-8, but got some non-ASCII ISO-8859 characters--or in
>>> this case, if it's calling CP1252.py, then it sounds like it's trying to
>>> decode some text as if it were cp1252. 0x81 is not a valid code point
>>> in cp1252, but it is a valid first byte of a two-byte UTF-8 encoding for
>>> U+00C1 (the upper case A with acute), which appears in the original
>>> article. So it sounds like the program is trying to interpret UTF-8
>>> text as if it were cp1252.
>>> Which version of Python were you using? There were, as I'm sure you
>>> know, significant changes in the handling of Unicode (and other
>>> encodings) between Python 2 and 3; this sounds like it might be a 2 vs.
>>> 3 issue.
>>> Mike Maxwell
>>> University of Maryland
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9324 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181007/b273fb93/attachment.txt>