[Corpora-List] Tools for historical languages?

Martin Reynaert reynaert at uvt.nl
Wed Nov 19 18:40:35 CET 2008

Dear Stefanie,

I am working on a tool to perform spelling normalization for large corpora - contemporary or historical - in the framework of a project for the National Library in the Netherlands.

The tool is called TICCL (pronounce 'tickle') for: Text-Induced Corpus Clean-up. The prototype was described in:

Non-interactive OCR post-correction for giga-scale digitization projects

Author(s): Martin Reynaert

Reference: In A. Gelbukh (Ed.), Proceedings of the Computational Linguistics and Intelligent Text Processing 9th International Conference, CICLing 2008. Lecture Notes in Computer Science Vol. 4919/2008, Berlin / Heidelberg: Springer, pp. 617-630.

The focus there was on OCR-misrecognition errors, but TICCL handles any kind of spelling variation. It is largely language-independent, but assumes an alphabet.

A production grade version should become available as free software sometime early next year. I intend to announce that event on this list.


Martin Reynaert ILK (Induction of Linguistic Knowledge) TiCC (Tilburg centre for Creative Computing) University of Tilburg


Stefanie Dipper wrote:
> Dear all,
> I'm looking for tools for the analysis of historical languages, e.g.
> sentence splitters, part-of-speech taggers, or spelling normalisers. I am
> working on German texts (diplomatic transcriptions) from the 11th-16th
> centuries, but I'd be interested in tools for any historical language, and
> tools for languages that lack a standardised spelling such as dialects.
> Thank you for any help,
> Stefanie
> --
> Jun.-Prof. Dr. Stefanie Dipper
> Sprachwiss. Institut, Ruhr-Universitaet Bochum
> D - 44780 Bochum, Germany
> http://www.linguistics.ruhr-uni-bochum.de/~dipper
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

More information about the Corpora mailing list