[Corpora-List] Covert Arabic PDF to txt for corpus use

Karin Verspoor karin.verspoor at unimelb.edu.au
Sat Apr 11 07:03:05 CEST 2020

Another tool not yet mentioned is Apache Tika: https://tika.apache.org/

Worth a try.

Caveat: I haven’t tried it for Arabic texts.


On 10/4/20, 8:17 pm, "corpora-bounces at uib.no<mailto:corpora-bounces at uib.no> on behalf of Mai Zaki" <corpora-bounces at uib.no<mailto:corpora-bounces at uib.no> on behalf of maizaki at gmail.com<mailto:maizaki at gmail.com>> wrote:

Hi all, Can anyone recommend a reliable and accurate way that can con convert PDF files in Arabic to text files (for Mac) to be used by corpus analysis software. I tried several types of software (the free trial version) with a sample to see if it works but they all failed miserably, I even tried the PDF converter package on R Studio, it worked with one file but not the other, and when it worked it still needed quite some editing. I am willing to pay to get a professional software to do it but I need to be sure that it actually works. Any advice is deeply appreciated. Thanks, Mai Zaki

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3610 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200411/64ae3658/attachment.txt>

More information about the Corpora mailing list