[Corpora-List] JusText

Mark Davies Mark_Davies at byu.edu
Fri Oct 5 06:41:22 CEST 2018


Sorry to send a “bug report” to CORPORA, but I’m guessing that there are a number of people here who use JusText (for boilerplate removal). I’ve emailed the developers, but haven’t received a reply.

I have JusText (version 2.1 on some machines, 2.2 on others) installed on several Windows machines (Server 2012, Server 2012 R2, Windows 10), and I’m having a problem with JusText crashing on about 40% of all files, due to encoding issues. The error message I get is:

File "c:\python32\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position …..: character maps to <undefined>

Just one example of a page that is causing it to crash:

http://www.democracynow.org/2012/7/6/peru_declares_state_of_emergency_as

I've tried every possible combination of

--encoding=...

--enc-force

--enc-errors=...

as well as every possible encoding on the files, and it's still crashing on about 40% of all files.

Again, sorry to post this to CORPORA, but hopefully someone might have some suggestions.

Thanks,

Mark Davies

============================================ Mark Davies Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/ ** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese ** ============================================



More information about the Corpora mailing list