I'm doing some work on evaluating algorithms for segmenting individual words in a text corpus, and then tagging each word-segment with a part-of-speech tag and multiple morphological features (e.g. lemma, person, gender, number, etc). I’m looking for tagged corpora for morphologically rich languages. Right now, the algorithms I'm looking to develop/evaluate would be for Arabic, but as part of the research, it would be great to see how such algorithms perform on other morphologically rich languages (e.g. there are many European languages which are considered to be morphologically rich).
For training and testing of statistical algorithms, I’m looking for *free* corpora available for download and offline analysis, that have been segmented into morpheme groups, and have had each segment tagged. Any pointers to resources of this nature would be very appreciated. Please note that I’m not after lexicons, dictionaries, analyzers or untagged text. I’m looking for annotated textual data of sentences with morphological segmentation and tagging.
After a quick search, I could only the find the following resource (which I myself have been involved in):
The Quranic Arabic Corpus (http://corpus.quran.com/documentation/morphologicalfeatures.jsp) – contains 77,430 words of Quranic Arabic, each divided into morphological segments and tagged with multiple features. Freely available.
Surely there must be more for other languages – or all such resources closed / non-free only?
It would be great if there were a few links to such downloadable resources that could be used to train and evaluate statistical morphological segmentation and tagging algorithms. Suggestions for any languages would be very appreciated.
Kais Dukes (sckd at leeds.ac.uk) Institute for Artificial Intelligence University of Leeds United Kingdom http://www.kaisdukes.com