[Corpora-List] Corpora Digest, Vol 154, Issue 12

Edward Jahn ejahn3141 at gmail.com
Sat Apr 11 14:55:44 CEST 2020


ABBYY Finereader will do it. https://promo.abbyy.com/finereader-overview-us.html?gclid=EAIaIQobChMI-riZhLXg6AIVEJSzCh1nDAdmEAAYASAAEgIM6_D_BwE

It will also do Chinese/Japanese/Korean, other right-to-left scripts, etc. I have had excellent results with multiple languages. Yes, it is not free. I have tried some of the free alternatives; none of them work well.

On Sat, Apr 11, 2020 at 6:00 AM <corpora-request at uib.no> wrote:


> Today's Topics:
>
> 1. Covert Arabic PDF to txt for corpus use (Mai Zaki)
> 2. Re: Covert Arabic PDF to txt for corpus use (Serge Heiden)
> 3. Re: Covert Arabic PDF to txt for corpus use (Eric Atwell)
> 4. Re: Covert Arabic PDF to txt for corpus use (reham marzouk)
> 5. Re: Covert Arabic PDF to txt for corpus use (Martin Weisser)
> 6. Call for participation: QuarantinePass project
> (Nihal Ya?mur AYDIN)
> 7. Re: Covert Arabic PDF to txt for corpus use (Karin Verspoor)
> 8. Re: Covert Arabic PDF to txt for corpus use
> (subscribe-1 at rambler.ru)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 10 Apr 2020 14:09:08 +0400
> From: Mai Zaki <maizaki at gmail.com>
> Subject: [Corpora-List] Covert Arabic PDF to txt for corpus use
> To: corpora at uib.no
>
> Hi all,
> Can anyone recommend a reliable and accurate way that can con convert PDF
> files in Arabic to text files (for Mac) to be used by corpus analysis
> software. I tried several types of software (the free trial version) with a
> sample to see if it works but they all failed miserably, I even tried the
> PDF converter package on R Studio, it worked with one file but not the
> other, and when it worked it still needed quite some editing. I am willing
> to pay to get a professional software to do it but I need to be sure that
> it actually works.
> Any advice is deeply appreciated.
> Thanks,
> Mai Zaki
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 670 bytes
> Desc: not available
> URL: <
> https://mailman.uib.no/public/corpora/attachments/20200410/cd695064/attachment.txt
> >
>
> ------------------------------
>
> Message: 2
> Date: Fri, 10 Apr 2020 12:47:42 +0200
> From: Serge Heiden <slh at ens-lyon.fr>
> Subject: Re: [Corpora-List] Covert Arabic PDF to txt for corpus use
> To: corpora at uib.no
>
> Hi Mai,
>
> The PDF formats are designed for printing, not for managing words and
> their characters (it excels in font management and rendering).
> I don't know any software able to convert reliably any PDF to text in
> French, English or Arabic.
> Generally, I start with pdftotext on Linux and if it doesn't work I go to
> OCR.
>
> Best,
> Serge
>
> Le 10/04/2020 à 12:09, Mai Zaki a écrit :
> > Hi all,
> > Can anyone recommend a reliable and accurate way that can con convert
> PDF files in Arabic to text files (for Mac) to be used by corpus analysis
> software. I tried several types of software (the free
> > trial version) with a sample to see if it works but they all failed
> miserably, I even tried the PDF converter package on R Studio, it worked
> with one file but not the other, and when it worked it
> > still needed quite some editing. I am willing to pay to get a
> professional software to do it but I need to be sure that it actually works.
> > Any advice is deeply appreciated.
> > Thanks,
> > Mai Zaki
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > https://mailman.uib.no/listinfo/corpora
>
> --
> Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
> ENS de Lyon - IHRIM UMR5317
> 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 2556 bytes
> Desc: not available
> URL: <
> https://mailman.uib.no/public/corpora/attachments/20200410/577730ae/attachment.txt
> >
>
> ------------------------------
>
> Message: 3
> Date: Fri, 10 Apr 2020 11:37:17 +0000
> From: Eric Atwell <E.S.Atwell at leeds.ac.uk>
> Subject: Re: [Corpora-List] Covert Arabic PDF to txt for corpus use
> To: Mai Zaki <maizaki at gmail.com>, "corpora at uib.no" <corpora at uib.no>
>
> Hello Mai,
>
> an alternative solution may be to search the Web (or ask on Forums) for a
> text transcript of your PDF Arabic book or document. For example
> https://iqsaweb.wordpress.com/2016/08/22/online-corpora-of-classical-arabic-texts/
> has links to several Arabic text sites:
> http://shamela.ws/
> http://www.alwaraq.net/
> http://lib.eshia.ir/
> https://www.noorlib.ir/
> http://shiaonlinelibrary.com/
>
> These are mainly Classical Arabic books, so this approach may not help if
> you need modern Arabic documents.
>
> I hope you are safe and well
>
> Eric
>
> Eric Atwell, Professor of Artificial Intelligence for Language
> PhD tutor, School of Computing, Uni of LEEDS, LS2 9JT, UK
> http://www.comp.leeds.ac.uk/eric
> http://qurananalysis.com http://corpus.quran.com
>
>
> ________________________________
> From: corpora-bounces at uib.no <corpora-bounces at uib.no> on behalf of Mai
> Zaki <maizaki at gmail.com>
> Sent: 10 April 2020 11:09
> To: corpora at uib.no <corpora at uib.no>
> Subject: [Corpora-List] Covert Arabic PDF to txt for corpus use
>
> Hi all,
> Can anyone recommend a reliable and accurate way that can con convert PDF
> files in Arabic to text files (for Mac) to be used by corpus analysis
> software. I tried several types of software (the free trial version) with a
> sample to see if it works but they all failed miserably, I even tried the
> PDF converter package on R Studio, it worked with one file but not the
> other, and when it worked it still needed quite some editing. I am willing
> to pay to get a professional software to do it but I need to be sure that
> it actually works.
> Any advice is deeply appreciated.
> Thanks,
> Mai Zaki
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 4574 bytes
> Desc: not available
> URL: <
> https://mailman.uib.no/public/corpora/attachments/20200410/7ac700d2/attachment.txt
> >
>
> ------------------------------
>
> Message: 4
> Date: Fri, 10 Apr 2020 13:31:55 +0200
> From: reham marzouk <marzoukreham at gmail.com>
> Subject: Re: [Corpora-List] Covert Arabic PDF to txt for corpus use
> To: Serge Heiden <slh at ens-lyon.fr>
> Cc: corpora at uib.no
>
> Hi Mai,
> I hope you you are fine. I prefer to use OCR too. Just convert the PDF
> files into images and use one such as Tesseract. It is free and Its recent
> versions show a good performance with Arabic characters.
>
> Best,
> Reham
>
> ??? ??????? 10 ????? 2020 ?? 12:47 ? ??? ????? ?? ??? ?????? ?Serge
> Heiden?? <?slh at ens-lyon.fr??>:?
>
> > Hi Mai,
> >
> > The PDF formats are designed for printing, not for managing words and
> > their characters (it excels in font management and rendering).
> > I don't know any software able to convert reliably any PDF to text in
> > French, English or Arabic.
> > Generally, I start with pdftotext on Linux and if it doesn't work I go to
> > OCR.
> >
> > Best,
> > Serge
> > Le 10/04/2020 à 12:09, Mai Zaki a écrit :
> >
> > Hi all,
> > Can anyone recommend a reliable and accurate way that can con convert PDF
> > files in Arabic to text files (for Mac) to be used by corpus analysis
> > software. I tried several types of software (the free trial version)
> with a
> > sample to see if it works but they all failed miserably, I even tried the
> > PDF converter package on R Studio, it worked with one file but not the
> > other, and when it worked it still needed quite some editing. I am
> willing
> > to pay to get a professional software to do it but I need to be sure that
> > it actually works.
> > Any advice is deeply appreciated.
> > Thanks,
> > Mai Zaki
> >
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing listCorpora at uib.nohttps://
> mailman.uib.no/listinfo/corpora
> >
> > --
> > Dr. Serge Heiden, slh at ens-lyon.fr, http://textometrie.ens-lyon.fr
> > ENS de Lyon - IHRIM UMR5317
> > 15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > https://mailman.uib.no/listinfo/corpora
> >
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 3338 bytes
> Desc: not available
> URL: <
> https://mailman.uib.no/public/corpora/attachments/20200410/23183e4e/attachment.txt
> >
>
> ------------------------------
>
> Message: 5
> Date: Fri, 10 Apr 2020 11:59:58 +0000 (UTC)
> From: Martin Weisser <martin at martinweisser.org>
> Subject: Re: [Corpora-List] Covert Arabic PDF to txt for corpus use
> To: "Corpora at uib.no" <Corpora at uib.no>, Mai Zaki <maizaki at gmail.com>
>
> Hi Mai,
>
>
> As you said you can pay for software, you might want to try Acrobat Pro
> with the save-as-text option. In my experience, this produces the best
> results for extraction, even though I don't know if it handles Arabic well.
> Acrobat Pro also has facilities for doing batch-mode jobs you can configure
> to run the extraction on any number of files.
>
>
>
>
>
> Cheers,
> Martin
> ========================
> Professor Martin Weisser
> Yunshan Outstanding Scholar
> Center for Linguistics & Applied Linguistics
> Guangdong University of Foreign Studies
> http://martinweisser.org
>
>
>
>
>
>
> On Fri, Apr 10, 2020 at 6:16 PM +0800, "Mai Zaki" <maizaki at gmail.com>
> wrote:
>
>
>
>
>
>
>
>
>
>
> Hi all, Can anyone recommend a reliable and accurate way that can con
> convert PDF files in Arabic to text files (for Mac) to be used by corpus
> analysis software. I tried several types of software (the free trial
> version) with a sample to see if it works but they all failed miserably, I
> even tried the PDF converter package on R Studio, it worked with one file
> but not the other, and when it worked it still needed quite some editing. I
> am willing to pay to get a professional software to do it but I need to be
> sure that it actually works.Any advice is deeply appreciated.Thanks,Mai Zaki
>
>
>
>
>
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 2306 bytes
> Desc: not available
> URL: <
> https://mailman.uib.no/public/corpora/attachments/20200410/a399d7e7/attachment.txt
> >
>
> ------------------------------
>
> Message: 6
> Date: Fri, 10 Apr 2020 14:11:40 +0200 (CEST)
> From: Nihal Ya?mur AYDIN <nihal-yagmur.aydin at unicaen.fr>
> Subject: [Corpora-List] Call for participation: QuarantinePass project
> To: corpora <corpora at uib.no>
>
> Dear all,
>
> We joined the "Global Hackathon" by teaming up with some members of Marie
> Curie Alumni Association, with my project submitted yesterday:
> [ https://devpost.com/software/helping-people-who-are-in-quarantine |
> https://devpost.com/software/helping-people-who-are-in-quarantine ]
>
> The project is about helping those who are in quarantine, with various
> forms of support: mental, emotional and physical. Physical aspect is about
> showing exercises,whereas emotional and mental parts are currently under
> discussion. Moreover, there will be a multi-lingual website for our
> implementation and we need linguists or native speakers of some languages
> to help our project grow. In addition to that, we also need those who can
> provide feedback for us, especially, for validating our project, therefore,
> it can be also good to participate in our project if you are quarantined.
>
> Please fill out this survey if you would like to contribute:
> [ https://www.surveymonkey.com/r/B3ZSPH7 |
> https://www.surveymonkey.com/r/B3ZSPH7 ]
>
> Please also note that, you have to be available tomorrow for your
> contribution, as the project period ends this Sunday (UTC).
>
> Thanks for your cooperation!
>
>
> Best Regards,
>
> Nihal Ya?mur AYDIN
>
> Research Engineer
> GREYC-CNRS-ENSICAEN
> Université de Caen Normandie
> Normandie Universitè
>
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 1896 bytes
> Desc: not available
> URL: <
> https://mailman.uib.no/public/corpora/attachments/20200410/18c72d45/attachment.txt
> >
>
> ------------------------------
>
> Message: 7
> Date: Sat, 11 Apr 2020 05:03:05 +0000
> From: Karin Verspoor <karin.verspoor at unimelb.edu.au>
> Subject: Re: [Corpora-List] Covert Arabic PDF to txt for corpus use
> To: Mai Zaki <maizaki at gmail.com>, "corpora at uib.no" <corpora at uib.no>
>
> Another tool not yet mentioned is Apache Tika: https://tika.apache.org/
>
> Worth a try.
>
> Caveat: I haven?t tried it for Arabic texts.
>
> Karin
>
> On 10/4/20, 8:17 pm, "corpora-bounces at uib.no<mailto:corpora-bounces at uib.no>
> on behalf of Mai Zaki" <corpora-bounces at uib.no<mailto:
> corpora-bounces at uib.no> on behalf of maizaki at gmail.com<mailto:
> maizaki at gmail.com>> wrote:
>
> Hi all,
> Can anyone recommend a reliable and accurate way that can con convert PDF
> files in Arabic to text files (for Mac) to be used by corpus analysis
> software. I tried several types of software (the free trial version) with a
> sample to see if it works but they all failed miserably, I even tried the
> PDF converter package on R Studio, it worked with one file but not the
> other, and when it worked it still needed quite some editing. I am willing
> to pay to get a professional software to do it but I need to be sure that
> it actually works.
> Any advice is deeply appreciated.
> Thanks,
> Mai Zaki
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: not available
> Type: text/html
> Size: 3610 bytes
> Desc: not available
> URL: <
> https://mailman.uib.no/public/corpora/attachments/20200411/64ae3658/attachment.txt
> >
>
> ------------------------------
>
> Message: 8
> Date: Fri, 10 Apr 2020 14:30:29 +0300
> From: <subscribe-1 at rambler.ru>
> Subject: Re: [Corpora-List] Covert Arabic PDF to txt for corpus use
> To: corpora at uib.no
>
> Dear Mai Zaki,
>
> Hope you are asking about PDF with text, not an image of text.
>
> I worked a lot with PDF and can confirm the words of Serge. PDF is
> designed for printing, so text is separated into chunks with coordinates,
> more over each software do it differently, so sometimes you could extract
> lines of text, or detached words (w/o spaces), or even solitary characters,
> which should be joined together to reconstruct the original text.
>
> The command `pdftotext` [1] is a part of the Poppler library and should be
> available on Mac [2]. It converts PDF into plain text, but I never was
> lucky to take correct result with it.
>
> >From my experience, the most reliable method is (1) convert PDF to XML,
> and then (2) write a script to join chunks.
>
> 1) Use `pdftohtml` [3]:
>
> > pdftohtml -xml input.pdf output.xml
>
> 2) For this, one need knowledge of programming, and the resulting script
> is not universal. It's needed to find correct shift corresponding to the
> space between words, and separate columns of the text (if needed).
>
> [1]:
> https://manpages.debian.org/experimental/poppler-utils/pdftotext.1.en.html
> [2]: http://macappstore.org/pdftotext/
> [3]:
> https://manpages.debian.org/experimental/poppler-utils/pdftohtml.1.en.html
>
> Best regards,
> Nikita
>
>
> On Fri, 10 Apr 2020 12:47:42 +0200
> Serge Heiden <slh at ens-lyon.fr> wrote:
>
> > Hi Mai,
> >
> > The PDF formats are designed for printing, not for managing words and
> > their characters (it excels in font management and rendering). I
> > don't know any software able to convert reliably any PDF to text in
> > French, English or Arabic. Generally, I start with pdftotext on Linux
> > and if it doesn't work I go to OCR.
> >
> > Best,
> > Serge
> >
> > Le 10/04/2020 à 12:09, Mai Zaki a écrit :
> > > Hi all,
> > > Can anyone recommend a reliable and accurate way that can con
> > > convert PDF files in Arabic to text files (for Mac) to be used by
> > > corpus analysis software. I tried several types of software (the
> > > free trial version) with a sample to see if it works but they all
> > > failed miserably, I even tried the PDF converter package on R
> > > Studio, it worked with one file but not the other, and when it
> > > worked it still needed quite some editing. I am willing to pay to
> > > get a professional software to do it but I need to be sure that it
> > > actually works. Any advice is deeply appreciated. Thanks, Mai Zaki
> > >
> > >
> > > _______________________________________________
> > > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > > Corpora mailing list
> > > Corpora at uib.no
> > > https://mailman.uib.no/listinfo/corpora
> >
>
>
>
>
> ----------------------------------------------------------------------
> Send Corpora mailing list submissions to
> corpora at uib.no
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mailman.uib.no/listinfo/corpora
> or, via email, send a message with subject or body 'help' to
> corpora-request at uib.no
>
> You can reach the person managing the list at
> corpora-owner at uib.no
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Corpora digest..."
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
>
>
> End of Corpora Digest, Vol 154, Issue 12
> ****************************************
>

-- Ed -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 26046 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200411/d4e10dfb/attachment.txt>



More information about the Corpora mailing list