[Corpora-List] POS statistics of lemmas in world languages

Michal Ptaszynski ptaszynski at ieee.org
Wed Jul 25 19:40:00 CEST 2018

Hi Mikhail,

You can also check my following paper. There is some statistics for Japanese, English and French.

Good luck with your research,

Michal Ptaszynski, Pawel Dybala, Rafal Rzepka, Kenji Araki and Yoshio Momouchi: "YACIS: A Five-Billion-Word Corpus of Japanese Blogs Fully Annotated with Syntactic and Affective Information" In Proceedings of The AISB/IACAP World Congress 2012 in Honour of Alan Turing, 2nd Symposium on Linguistic and Cognitive Approaches To Dialog Agents (LaCATODA 2012), pp. 40-49, 2-6 July 2012, University of Birmingham, Birmingham, UK http://arakilab.media.eng.hokudai.ac.jp/~ptaszynski/data/LaCATODA2012_yacis_paper.pdf

Michal Ptaszynski

> Wiadomość napisana przez Vladimír Benko <vladimir.benko at juls.savba.sk> w dniu 24.07.2018, o godz. 19:19:
> Dear Mikhail,
> Statistical data for some of the languages mentioned are available via web interface at our Aranea Portal site:
> http://unesco.uniba.sk/guest/
> Всего доброго,
> Vlado B, 12:20
>> Dear all,
>> I am wondering if anybody could provide data or point out to the POS statistics of lemmas in world languages, i.e. how many unique verbal/nominal/etc. lemmas are found in a corpus of a known size. Obviously, the figures highly depend on corpus size/genres etc, thus I am looking for the data based on more or less balanced, 100+ mln. corpora. Especially, I am interested in the following languages:
>> Japanese, Korean, Chinese
>> Hungarian, Finnish
>> Hindi, German, Dutch, English, Polish, Greek, Romanian, Spanish, French
>> Malay
>> Arabic
>> Basque
>> Thank you in advance for any pointers!
>> Best, Mikhail
>> --
>> Mikhail Kopotev, PhD habil.
>> Associate Professor
>> Dept. of Modern Languages
>> University of Helsinki
>> http://www.helsinki.fi/~kopotev
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> https://mailman.uib.no/listinfo/corpora
> --
> Vladimír Benko
> Slovak Academy of Sciences
> Ľ. Štúr Institute of Linguistics
> Panská 26, SK-81101 Bratislava
> Tel +421-2-54431762 Fax -54431756
> http://aranea.juls.savba.sk/guest/
> https://www.facebook.com/araneawebcorpora/
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 4484 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180726/79f55092/attachment.txt>

More information about the Corpora mailing list