[Corpora-List] converting PDFs to ASCII or text-only files without clumps

Christian Chiarcos christian.chiarcos at web.de
Wed Jun 16 14:21:28 CEST 2010


Sorry for the confusion, the *more* in my mail was an artifact. No comparison with Tika intended. It referred to the original first line of my mail that mentioned ps2ascii, but I've removed this line because ps2ascii is not really an option, neither for special characters nor for the clumps problem.

Christian


> *Comment off list*
>
> FYI : Tika provides a XHTML representation of the input. Just for my own
> interest, could you explain why you think it is a more suitable option?
>
> Thanks



More information about the Corpora mailing list