[Corpora-List] JusText

Miloš Jakubíček milos.jakubicek at sketchengine.co.uk
Mon Oct 8 11:14:42 CEST 2018


Hi Mark,

you may want to try the "original" jusText from http://corpus.tools/ and use it with Python 2. Let me know if you face any problems with that -- that is something we could look into.

Best Milos

Milos Jakubicek

CEO, Lexical Computing Brno, CZ | Brighton UK http://www.lexicalcomputing.com http://www.sketchengine.co.uk

On Mon, 8 Oct 2018 at 02:40, Mark Davies <Mark_Davies at byu.edu> wrote:


> Milos,
>
> - where does this version of jusText come from (there are waaaay too many
> forks available)?
>
> https://pypi.org/project/jusText/ (I think; it was 3-4 years ago)
>
>
> >> - whom did you address your mail to, when you emailed the developers?
>
> Miso Belica (and others) about 2 years ago; see for example:​
>
> https://github.com/miso-belica/jusText/issues/20
>
> Best,
>
> Mark Davies
>
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
> ------------------------------
> *From:* Miloš Jakubíček <milos.jakubicek at sketchengine.co.uk>
> *Sent:* Saturday, October 6, 2018 2:14 PM
> *To:* Mark Davies
> *Cc:* osherenko at gmx.de; maxwell at umiacs.umd.edu; Corpora list
> *Subject:* Re: [Corpora-List] JusText
>
> Hi Mark,
>
> can I just double check two things:
>
> - where does this version of jusText come from (there are waaaay too many
> forks available)?
> - whom did you address your mail to, when you emailed the developers?
>
> The original version of jusText was designed for Python 2, not Python 3,
> so the first thing to try would be that (the default string handling has
> changed between Pyhon 2 and Python 3).
>
> Best
> Milos
>
>
> Milos Jakubicek
>
> CEO, Lexical Computing
> Brno, CZ | Brighton UK
> http://www.lexicalcomputing.com
> http://www.sketchengine.co.uk
>
>
> On Sat, 6 Oct 2018 at 05:44, Mark Davies <Mark_Davies at byu.edu> wrote:
>
>> Thanks for the feedback re. encoding errors with JusText.
>>
>>
>> This is probably just a Windows thing, but I figured out today that the
>> problem is solved by converting all files to ANSI before processing them
>> with JusText. That's obviously not going to work for all languages and it
>> would probably confuse things for other OS's, but it works quite well for
>> my English language texts on a Windows machine.
>>
>>
>> Again, thanks for the feedback.
>>
>>
>> Mark Davies
>>
>>
>> ============================================
>> Mark Davies
>> Professor of Linguistics / Brigham Young University
>> http://davies-linguistics.byu.edu/
>>
>> ** Corpus design and use // Linguistic databases **
>> ** Historical linguistics // Language variation **
>> ** English, Spanish, and Portuguese **
>> ============================================
>> ------------------------------
>> *From:* Alexander Osherenko <osherenko at gmx.de>
>> *Sent:* Friday, October 5, 2018 10:55 AM
>> *To:* maxwell
>> *Cc:* Mark Davies; Corpora at uib.no
>> *Subject:* Re: [Corpora-List] JusText
>>
>> I doubt I didn't notice the 2 vs 3 issue since I am working on parsing
>> malicious strings for a long time, but anyhow I am using Python 3.4.
>>
>> --
>> Alexander Osherenko, Dr. rer. nat.
>> Senior HCI architect
>> Founder and R&D
>> Socioware Development <http://www.socioware.de/osherenko_page.html>
>> Profile: ResearchGate
>> <https://www.researchgate.net/profile/Alexander_Osherenko>
>> Implementing Social Smart Environments with a Large Number of Believable
>> Inhabitants in the Context of Globalization
>> <https://www.researchgate.net/publication/327425719_Implementing_Social_Smart_Environments_with_a_Large_Number_of_Believable_Inhabitants_in_the_Context_of_Globalization> at
>> Springer
>>
>>
>> Am Fr., 5. Okt. 2018 um 18:06 Uhr schrieb maxwell <maxwell at umiacs.umd.edu
>> >:
>>
>>> On 2018-10-05 03:17, Alexander Osherenko wrote:
>>> > Mark, I do know the charmap problem you are talking about from NLTK.
>>> In
>>> > my
>>> > case, there were also problems with inputs to encode containing French
>>> > characters and I got the message "character XXX in position YYY can't
>>> > be
>>> > encoded using the ZZZ encoding". As far as I know it is a bug, but I
>>> > didn't
>>> > want to fix it.
>>>
>>> I haven't used charmap, so I just have some general questions/
>>> suggestions. In my own experience, this error usually means that Python
>>> is expecting UTF-8, but got some non-ASCII ISO-8859 characters--or in
>>> this case, if it's calling CP1252.py, then it sounds like it's trying to
>>> decode some text as if it were cp1252. 0x81 is not a valid code point
>>> in cp1252, but it is a valid first byte of a two-byte UTF-8 encoding for
>>> U+00C1 (the upper case A with acute), which appears in the original
>>> article. So it sounds like the program is trying to interpret UTF-8
>>> text as if it were cp1252.
>>>
>>> Which version of Python were you using? There were, as I'm sure you
>>> know, significant changes in the handling of Unicode (and other
>>> encodings) between Python 2 and 3; this sounds like it might be a 2 vs.
>>> 3 issue.
>>>
>>> Mike Maxwell
>>> University of Maryland
>>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
>>
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 11426 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181008/b94ccd3c/attachment.txt>



More information about the Corpora mailing list