[Corpora-List] Language complexity for textual processing

Taras Zagibalov taras8055 at gmail.com
Thu Jan 26 11:04:30 CET 2012


I wonder if anyone knows a research on language complexity evaluation regarding textual processing? Intuitively, I can, for example, assume that English is easier for text processing than French because the latter is more inflected than English which would require more complex lemmatisation. German is probably more complex than French because of "word-chaining" on top of inflection. Chinese is much easier because of lack of infection but absence of word delimiters makes this language difficult for traditional "word-based" processing (please note, that I mean text processing thus ignoring complex tonal phonetics of the language). Russian and many other Slavic languages are difficult due to morphology and free word order, Arabic is difficult due to variety of regional dialects and syllabic-consonant writing system. Hebrew should be similar to Arabic, 'minus' regional diversity.

Has anyone tried to rank/group these language according to the amount of labour required to produce a NLP system for these languages? I do not mean availability of already developed tools but rather developing 'from scratch'?

Thanks a lot.


PS I am aware of existing language complexity ranking but it is developed in regard of second language acquisition which involves phonetics.

More information about the Corpora mailing list