[Corpora-List] Covert Arabic PDF to txt for corpus use

Martin Weisser martin at martinweisser.org
Fri Apr 10 13:59:58 CEST 2020


Hi Mai,

As you said you can pay for software, you might want to try Acrobat Pro with the save-as-text option. In my experience, this produces the best results for extraction, even though I don't know if it handles Arabic well. Acrobat Pro also has facilities for doing batch-mode jobs you can configure to run the extraction on any number of files.

Cheers, Martin ======================== Professor Martin Weisser Yunshan Outstanding Scholar Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies http://martinweisser.org

On Fri, Apr 10, 2020 at 6:16 PM +0800, "Mai Zaki" <maizaki at gmail.com> wrote:

Hi all, Can anyone recommend a reliable and accurate way that can con convert PDF files in Arabic to text files (for Mac) to be used by corpus analysis software. I tried several types of software (the free trial version) with a sample to see if it works but they all failed miserably, I even tried the PDF converter package on R Studio, it worked with one file but not the other, and when it worked it still needed quite some editing. I am willing to pay to get a professional software to do it but I need to be sure that it actually works.Any advice is deeply appreciated.Thanks,Mai Zaki

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2306 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200410/a399d7e7/attachment.txt>



More information about the Corpora mailing list