[Corpora-List] A new dataset release - WS353 translated to multiple languages an re-scored by fluent speakers of these languages

Roi Reichart roiri at ie.technion.ac.il
Mon Aug 31 01:11:33 CEST 2015

A quick update:

Our page containing the multilingual datasets with human judgment for word pair relatedness has moved to:


In the next few days we will upload to the page scored translations of the simLex999 dataset. Like for wordSim353 the translations are to German, Italian and Russian and similarity scores are collected from fluent speakers of the target languages. We will post an announcement when releasing the data.

Best, Roi Reichart

On Wed, Aug 12, 2015 at 8:32 PM, Roi Reichart <roiri at ie.technion.ac.il> wrote:

> Greetings,
> We would like to announce the release of a new resource - multilingual
> WS353. This resource consists of translations of the WS353 word
> association data set to three languages: German, Italian and Russian.
> Each of the translated datasets is scored by 13 human judges (crowd
> workers) - all fluent speakers of its language. For consistency, we
> also collected human judgments for the original English corpus
> according to the same protocol applied to the other languages.
> This dataset allows to explore the impact of the "judgement language"
> (the language in which word pairs are presented to the human judges)
> on the resulted similarity scores and to evaluate vector space models
> on a truly multilingual setup (i.e. when both the training and the
> test data are multilingual).
> The translation and annotation process, as well as related research on
> the impact of judgment language are described in the paper:
> Judgment Language Matters: Multilingual Vector Space Models for
> Judgment Language Aware Lexical Semantics. 2015. Ira Leviant, Roi
> Reichart . Preprint pubslished on arXiv. arxiv:1508.00106
> The data and paper can be downloaded from the project page at:
> http://technion.ac.il/~irakr/MultilingualVSMdata.html
> We will soon release similar data for the simLex999 word similarity
> dataset.
> Please do not hesitate to contact Ira or myself with any question you
> may have regarding this data.
> Best,
> Roi Reichart
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2807 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150831/f14afc59/attachment.txt>

More information about the Corpora mailing list