[Corpora-List] Covert Arabic PDF to txt for corpus use

Serge Heiden slh at ens-lyon.fr
Fri Apr 10 12:47:42 CEST 2020

Hi Mai,

The PDF formats are designed for printing, not for managing words and their characters (it excels in font management and rendering). I don't know any software able to convert reliably any PDF to text in French, English or Arabic. Generally, I start with pdftotext on Linux and if it doesn't work I go to OCR.

Best, Serge

Le 10/04/2020 à 12:09, Mai Zaki a écrit :
> Hi all,
> Can anyone recommend a reliable and accurate way that can con convert PDF files in Arabic to text files (for Mac) to be used by corpus analysis software. I tried several types of software (the free
> trial version) with a sample to see if it works but they all failed miserably, I even tried the PDF converter package on R Studio, it worked with one file but not the other, and when it worked it
> still needed quite some editing. I am willing to pay to get a professional software to do it but I need to be sure that it actually works.
> Any advice is deeply appreciated.
> Thanks,
> Mai Zaki
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora

-- Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr ENS de Lyon - IHRIM UMR5317 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2556 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200410/577730ae/attachment.txt>

More information about the Corpora mailing list