[Corpora-List] A new dataset release - WS353 translated to multiple languages an re-scored by fluent speakers of these languages

Roi Reichart roiri at ie.technion.ac.il
Wed Aug 12 19:32:44 CEST 2015


Greetings,

We would like to announce the release of a new resource - multilingual WS353. This resource consists of translations of the WS353 word association data set to three languages: German, Italian and Russian. Each of the translated datasets is scored by 13 human judges (crowd workers) - all fluent speakers of its language. For consistency, we also collected human judgments for the original English corpus according to the same protocol applied to the other languages.

This dataset allows to explore the impact of the "judgement language" (the language in which word pairs are presented to the human judges) on the resulted similarity scores and to evaluate vector space models on a truly multilingual setup (i.e. when both the training and the test data are multilingual).

The translation and annotation process, as well as related research on the impact of judgment language are described in the paper:

Judgment Language Matters: Multilingual Vector Space Models for Judgment Language Aware Lexical Semantics. 2015. Ira Leviant, Roi Reichart . Preprint pubslished on arXiv. arxiv:1508.00106

The data and paper can be downloaded from the project page at:

http://technion.ac.il/~irakr/MultilingualVSMdata.html

We will soon release similar data for the simLex999 word similarity dataset.

Please do not hesitate to contact Ira or myself with any question you may have regarding this data.

Best, Roi Reichart



More information about the Corpora mailing list