I've recently published a work on developing a Japanese blog corpus. One part of this work consists of comparison between POS distributions among different sizes (small, medium, large) but similar genres/language, and comparable sizes, but different languages (Japanese, British English, Italian). *-->
Interestingly, the paper was once rejected, with one of the reasons being that comparison of POS distributions is "meaningless" and a "waste of time". I'm happy, that at least some people think it is not.
*) Michal Ptaszynski, Pawel Dybala, Rafal Rzepka, Kenji Araki and Yoshio Momouchi, “YACIS: A Five-Billion-Word Corpus of Japanese Blogs Fully Annotated with Syntactic and Affective Information”, In Proceedings of The AISB/IACAP World Congress 2012 in Honour of Alan Turing, 2nd Symposium on Linguistic and Cognitive Approaches To Dialog Agents (LaCATODA 2012), pp. 40-49
----------------------------- Od: Karin Cavallin <karin.cavallin at ling.gu.se> Do: "corpora at uib.no" <corpora at uib.no> Data: Wed, 12 Dec 2012 10:00:46 +0000 Temat: [Corpora-List] Difference in POS tag distribution in different genres
Does anyone know of any study of the difference in (and an analysis of the reasons) part-of-speech tag distribution in different genres? A quick study I made yesterday showed e.g. that my working hypothesis that there are more proper nouns in news paper text than in fiction was correct, at least on the data I examined.
Karin Cavallin PhD Student in Computational Linguistics University of Gothenburg, Sweden