[Corpora-List] English corpus of OCRed texts

Muthu muthu.chandra at comp.nus.edu.sg
Thu Apr 27 23:59:23 CEST 2017

Hi Katya

​If you are okay with scientific documents (academic papers) then our ACL Anthology Reference Corpus version 2 might fit your need.

​You can download and use it for free from here: http://acl-arc.comp.nus.edu.sg​

The OCRs are available are XML output from ​Nuance Omnipage OCR. There is no hand extracted ground truth. But one can always validate the output against the original PDF documents which are also available for download on our website.

​Please let us know should you have any queries related to our corpus.

Cheers! Muthu

Muthu Kumar Chandrasekaran Ph.D. Candidate | Web Information Retrieval / Natural Language Processing Group (WING) School of Computing | National University of Singapore (NUS) My Homepage <http://www.comp.nus.edu.sg/~a0092669/>

On 28 April 2017 at 00:12, Katsiaryna Stalpouskaya <katerina.sto at gmail.com> wrote:

> Hi all,
> Do you know whether there exists a freely available corpus of English
> OCRed texts and ground truth for them? Thanks in advance!
> Best regards,
> Katya
> --
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 3762 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20170428/0101113c/attachment.txt>

More information about the Corpora mailing list