[Corpora-List] JusText

Alexander Osherenko osherenko at gmx.de
Fri Oct 5 09:17:44 CEST 2018


Mark, I do know the charmap problem you are talking about from NLTK. In my case, there were also problems with inputs to encode containing French characters and I got the message "character XXX in position YYY can't be encoded using the ZZZ encoding". As far as I know it is a bug, but I didn't want to fix it.

Hence, I debugged the Python code in NLTK and found the line "input_.encode(encoding)" that throws the encoding exception in the _execute function preparing the input for the java call that parses the input.

I used three solutions of the problem. The simplest was to replace characters to eliminate the source of the encoding problem. Another solution was to use other encoding or to detect it automatically using the chardet package (it worked in some cases). The third solution was to simply ignore the Unicode exception. Here, the Python code for solutions two-three:

import chardet possibleEncodings = ["utf-8", "detect", "iso-8859-1", "latin-1", "ascii", ] for encoding in possibleEncodings:

sys.stderr.write("processing encoding '%s'\n" % encoding)

if encoding == "detect":

result = chardet.detect(parse)

encoding = result['encoding']

sys.stderr.write("Detected encoding %s...\n" % encoding)

try:

#sys.stderr.write("pas: %s\n" % parse)

parse = self._parse_trees_output(str(parse, encoding))

break

except UnicodeDecodeError as e:

sys.stderr.write("UnicodeDecodeError: encoding %s...\n" % encoding)

continue

It is, of course, not the best solution, but I could live with that.

Alexander

-- Alexander Osherenko, Dr. rer. nat. Senior HCI architect

Founder and R&D Socioware Development <http://www.socioware.de/osherenko_page.html>

Humboldt Innovation Humboldt-Universitšt zu Berlin

Profile: ResearchGate <https://www.researchgate.net/profile/Alexander_Osherenko> Implementing Social Smart Environments with a Large Number of Believable Inhabitants in the Context of Globalization <https://www.researchgate.net/publication/327425719_Implementing_Social_Smart_Environments_with_a_Large_Number_of_Believable_Inhabitants_in_the_Context_of_Globalization> at Springer

Am Fr., 5. Okt. 2018 um 06:47 Uhr schrieb Mark Davies <Mark_Davies at byu.edu>:


> Sorry to send a “bug report” to CORPORA, but I’m guessing that there are a
> number of people here who use JusText (for boilerplate removal). I’ve
> emailed the developers, but haven’t received a reply.
>
> I have JusText (version 2.1 on some machines, 2.2 on others) installed on
> several Windows machines (Server 2012, Server 2012 R2, Windows 10), and I’m
> having a problem with JusText crashing on about 40% of all files, due to
> encoding issues. The error message I get is:
>
> File "c:\python32\lib\encodings\cp1252.py", line 23, in decode return
> codecs.charmap_decode(input,self.errors,decoding_table)[0]
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position
> …..: character maps to <undefined>
>
> Just one example of a page that is causing it to crash:
>
> http://www.democracynow.org/2012/7/6/peru_declares_state_of_emergency_as
>
> I've tried every possible combination of
>
> --encoding=...
> --enc-force
> --enc-errors=...
>
> as well as every possible encoding on the files, and it's still crashing
> on about 40% of all files.
>
> Again, sorry to post this to CORPORA, but hopefully someone might have
> some suggestions.
>
> Thanks,
>
> Mark Davies
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 8531 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20181005/ce1c366e/attachment.txt>



More information about the Corpora mailing list