[Corpora-List] pdfs/ OCR question

John F. Sowa sowa at bestweb.net
Tue Dec 12 04:33:00 CET 2006

That depends on how the PDF was created:

> interesting to know that pdf files store text info separately!

Some PDF files are generated by scanning each page of a book or
article into an image format (GIF or TIFF, for example). In such
a PDF file, there are no character strings internally, and some
kind of OCR is necessary to convert the image into a character
string. The OCR process might convert an image for "the"
into the character string "die".

But if the PDF file had been generated from a text string in
any textual form, such as HTML, LaTeX, TXT, ODT, or DOC formats,
the internal PDF file preserves the original text strings. If
you copy and paste text from a PDF of that kind into an editor
for some other kind of text, such as OpenOffice or MS Word, you
will get a copy of the original character string, but some or
all of the formatting info may be lost. That process would
never convert "the" into "die".

There are some caveats, however. Some PDF files may have
special characters for ligatures, such as fi, fl, ff, etc.
Even though the ligatures are represented in character strings,
a copy & paste from such files to another editor may convert
the ligature to an unrecognized character. (Some OCR systems
also have difficulty with ligatures because the letters "f"
and "i" or "l" are too close together for easy recognition.)

John Sowa

More information about the Corpora-archive mailing list