[Corpora-List] Covert Arabic PDF to txt for corpus use

reham marzouk marzoukreham at gmail.com
Fri Apr 10 13:31:55 CEST 2020


Hi Mai, I hope you you are fine. I prefer to use OCR too. Just convert the PDF files into images and use one such as Tesseract. It is free and Its recent versions show a good performance with Arabic characters.

Best, Reham

‫في الجمعة، 10 أبريل 2020 في 12:47 م تمت كتابة ما يلي بواسطة ‪Serge Heiden‬‏ <‪slh at ens-lyon.fr‬‏>:‬


> Hi Mai,
>
> The PDF formats are designed for printing, not for managing words and
> their characters (it excels in font management and rendering).
> I don't know any software able to convert reliably any PDF to text in
> French, English or Arabic.
> Generally, I start with pdftotext on Linux and if it doesn't work I go to
> OCR.
>
> Best,
> Serge
> Le 10/04/2020 à 12:09, Mai Zaki a écrit :
>
> Hi all,
> Can anyone recommend a reliable and accurate way that can con convert PDF
> files in Arabic to text files (for Mac) to be used by corpus analysis
> software. I tried several types of software (the free trial version) with a
> sample to see if it works but they all failed miserably, I even tried the
> PDF converter package on R Studio, it worked with one file but not the
> other, and when it worked it still needed quite some editing. I am willing
> to pay to get a professional software to do it but I need to be sure that
> it actually works.
> Any advice is deeply appreciated.
> Thanks,
> Mai Zaki
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing listCorpora at uib.nohttps://mailman.uib.no/listinfo/corpora
>
> --
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
> ENS de Lyon - IHRIM UMR5317
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3338 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200410/23183e4e/attachment.txt>



More information about the Corpora mailing list