[Corpora-List] JusText

Miloš Jakubíček milos.jakubicek at sketchengine.co.uk
Sat Oct 6 22:14:10 CEST 2018


Hi Mark,

can I just double check two things:

- where does this version of jusText come from (there are waaaay too many forks available)? - whom did you address your mail to, when you emailed the developers?

The original version of jusText was designed for Python 2, not Python 3, so the first thing to try would be that (the default string handling has changed between Pyhon 2 and Python 3).

Best Milos

Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton UK http://www.lexicalcomputing.com http://www.sketchengine.co.uk

On Sat, 6 Oct 2018 at 05:44, Mark Davies <Mark_Davies at byu.edu> wrote:


> Thanks for the feedback re. encoding errors with JusText.
>
>
> This is probably just a Windows thing, but I figured out today that the
> problem is solved by converting all files to ANSI before processing them
> with JusText. That's obviously not going to work for all languages and it
> would probably confuse things for other OS's, but it works quite well for
> my English language texts on a Windows machine.
>
>
> Again, thanks for the feedback.
>
>
> Mark Davies
>
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
> ------------------------------
> *From:* Alexander Osherenko <osherenko at gmx.de>
> *Sent:* Friday, October 5, 2018 10:55 AM
> *To:* maxwell
> *Cc:* Mark Davies; Corpora at uib.no
> *Subject:* Re: [Corpora-List] JusText
>
> I doubt I didn't notice the 2 vs 3 issue since I am working on parsing
> malicious strings for a long time, but anyhow I am using Python 3.4.
>
> --
> Alexander Osherenko, Dr. rer. nat.
> Senior HCI architect
> Founder and R&D
> Socioware Development <http://www.socioware.de/osherenko_page.html>
> Profile: ResearchGate
> <https://www.researchgate.net/profile/Alexander_Osherenko>
> Implementing Social Smart Environments with a Large Number of Believable
> Inhabitants in the Context of Globalization
> <https://www.researchgate.net/publication/327425719_Implementing_Social_Smart_Environments_with_a_Large_Number_of_Believable_Inhabitants_in_the_Context_of_Globalization> at
> Springer
>
>
> Am Fr., 5. Okt. 2018 um 18:06 Uhr schrieb maxwell <maxwell at umiacs.umd.edu
> >:
>
>> On 2018-10-05 03:17, Alexander Osherenko wrote:
>> > Mark, I do know the charmap problem you are talking about from NLTK. In
>> > my
>> > case, there were also problems with inputs to encode containing French
>> > characters and I got the message "character XXX in position YYY can't
>> > be
>> > encoded using the ZZZ encoding". As far as I know it is a bug, but I
>> > didn't
>> > want to fix it.
>>
>> I haven't used charmap, so I just have some general questions/
>> suggestions. In my own experience, this error usually means that Python
>> is expecting UTF-8, but got some non-ASCII ISO-8859 characters--or in
>> this case, if it's calling CP1252.py, then it sounds like it's trying to
>> decode some text as if it were cp1252. 0x81 is not a valid code point
>> in cp1252, but it is a valid first byte of a two-byte UTF-8 encoding for
>> U+00C1 (the upper case A with acute), which appears in the original
>> article. So it sounds like the program is trying to interpret UTF-8
>> text as if it were cp1252.
>>
>> Which version of Python were you using? There were, as I'm sure you
>> know, significant changes in the handling of Unicode (and other
>> encodings) between Python 2 and 3; this sounds like it might be a 2 vs.
>> 3 issue.
>>
>> Mike Maxwell
>> University of Maryland
>>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 7039 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181006/e02a13ce/attachment.txt>



More information about the Corpora mailing list