[Corpora-List] Covert Arabic PDF to txt for corpus use

subscribe-1 at rambler.ru subscribe-1 at rambler.ru
Fri Apr 10 13:30:29 CEST 2020


Dear Mai Zaki,

Hope you are asking about PDF with text, not an image of text.

I worked a lot with PDF and can confirm the words of Serge. PDF is designed for printing, so text is separated into chunks with coordinates, more over each software do it differently, so sometimes you could extract lines of text, or detached words (w/o spaces), or even solitary characters, which should be joined together to reconstruct the original text.

The command `pdftotext` [1] is a part of the Poppler library and should be available on Mac [2]. It converts PDF into plain text, but I never was lucky to take correct result with it.

From my experience, the most reliable method is (1) convert PDF to XML, and then (2) write a script to join chunks.

1) Use `pdftohtml` [3]:


> pdftohtml -xml input.pdf output.xml

2) For this, one need knowledge of programming, and the resulting script is not universal. It's needed to find correct shift corresponding to the space between words, and separate columns of the text (if needed).

[1]: https://manpages.debian.org/experimental/poppler-utils/pdftotext.1.en.html [2]: http://macappstore.org/pdftotext/ [3]: https://manpages.debian.org/experimental/poppler-utils/pdftohtml.1.en.html

Best regards, Nikita

On Fri, 10 Apr 2020 12:47:42 +0200 Serge Heiden <slh at ens-lyon.fr> wrote:


> Hi Mai,
>
> The PDF formats are designed for printing, not for managing words and
> their characters (it excels in font management and rendering). I
> don't know any software able to convert reliably any PDF to text in
> French, English or Arabic. Generally, I start with pdftotext on Linux
> and if it doesn't work I go to OCR.
>
> Best,
> Serge
>
> Le 10/04/2020 à 12:09, Mai Zaki a écrit :
> > Hi all,
> > Can anyone recommend a reliable and accurate way that can con
> > convert PDF files in Arabic to text files (for Mac) to be used by
> > corpus analysis software. I tried several types of software (the
> > free trial version) with a sample to see if it works but they all
> > failed miserably, I even tried the PDF converter package on R
> > Studio, it worked with one file but not the other, and when it
> > worked it still needed quite some editing. I am willing to pay to
> > get a professional software to do it but I need to be sure that it
> > actually works. Any advice is deeply appreciated. Thanks, Mai Zaki
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > https://mailman.uib.no/listinfo/corpora
>



More information about the Corpora mailing list