We have been working for the past few years on a tool for normalising spelling in historical corpora (particularly Early Modern English) named VARD 2. The tool can be used to manually and automatically standardise texts or an entire corpus. Variants are replaced with modern equivalents by the tool, with xml tags used to retain the original spelling. The tool also learns which replacement methods are most effective, so training the tool on a relatively small sample will result in improved standardisation of a particular corpus.
The tool was developed for Early Modern English, however by plugging in other dictionaries and through training, the tool can be used with other languages and varieties.
Further details of our research are available at http://ucrel.lancs.ac.uk/VariantSpelling/ and the tool itself is available to use for free (for academic use) from: http://www.comp.lancs.ac.uk/~barona/vard2/, further details and a user guide are also available.
We've also recently completed studies investigating the effect of spelling variation on corpus linguistic techniques:
For keyword analysis: Baron, A., Rayson, P. and Archer, D. (forthcoming). Word frequency and key word statistics in historical corpus linguistics. International Journal of English Studies.
And for part-of-speech tagging: Rayson, P., Archer, D., Baron, A., Culpeper, J. and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Davies, M., Rayson, P., Hunston, S. and Danielsson, P. (eds.) Proceedings of the Corpus Linguistics Conference: CL2007, University of Birmingham, UK, 27-30 July 2007.
Both studies quantify the effect of spelling variation on corpus linguistic studies. The former paper also quantifies the levels of spelling variation in various Early Modern English corpora including Early English Books Online.
Please get in touch if you require more details.
Regards, Alistair Baron
________________________ Alistair Baron C28, Computing Department, Infolab 21, South Drive, Lancaster University, Lancaster, LA1 4WA
T: +44(0) 15245 10348 E: a.baron at comp.lancs.ac.uk -----Original Message----- From: Stefanie Dipper <dipper at linguistics.rub.de> Date: 2008/11/19 Subject: [Corpora-List] Tools for historical languages? To: CORPORA at uib.no
I'm looking for tools for the analysis of historical languages, e.g. sentence splitters, part-of-speech taggers, or spelling normalisers. I am working on German texts (diplomatic transcriptions) from the 11th-16th centuries, but I'd be interested in tools for any historical language, and tools for languages that lack a standardised spelling such as dialects.
Thank you for any help, Stefanie
-- Jun.-Prof. Dr. Stefanie Dipper Sprachwiss. Institut, Ruhr-Universitaet Bochum D - 44780 Bochum, Germany http://www.linguistics.ruhr-uni-bochum.de/~dipper
_______________________________________________ Corpora mailing list Corpora at uib.no http://mailman.uib.no/listinfo/corpora -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 6070 bytes Desc: not available Url : https://mailman.uib.no/public/corpora/attachments/20081120/cfa3bdfc/attachment.bin