[Corpora-List] A Shared Task on Word Sense Induction and Disambiguation for the Russian Language: Training Datasets are now Available

Alexander Panchenko panchenkoalexander at gmail.com
Sat Nov 11 22:09:43 CET 2017

We invite you to participate in the ACL SIGSLAV <http://sigslav.cs.helsinki.fi/> sponsored shared task on Word Sense Induction and Disambiguation for the Russian Language: *http://russe.nlpub.org/2018/wsi* <http://russe.nlpub.org/2018/wsi>. *TLDR of the task*: You are given a word, e.g. "bank" and a bunch of text fragments (aka “contexts”) where this word occurs, e.g. "bank is a financial institution that accepts deposits" and "river bank is a slope beside a body of water". You need to cluster these contexts in the (unknown in advance) number of clusters which correspond to various senses of the word. In this example, you want to have two groups with the contexts of the company and the area senses of the word bank.

The training dataset and detailed instructions for participants are available at our GitHub repository <https://nlpub.github.io/russe-wsi-kit/>. If you are interested in participation, please register using this form <https://goo.gl/forms/fnTNOwk4PrsZySX82>.


Word Sense Induction (WSI) is the process of automatic identification of the word senses. While evaluation of various sense induction and disambiguation approaches was performed in the past for the Western European languages, e.g., English, French, and German, no systematic evaluation of WSI for Slavic languages <http://sigslav.cs.helsinki.fi/> are available at the moment. This shared task makes a first step towards bridging this gap by setting up a shared task in one Slavic language. The goal of this task is to compare sense induction and disambiguation systems for the Russian language. Many Slavic languages still do not have broad coverage lexical resources available in English, such as WordNet, which provide a comprehensive inventory of senses. Therefore, word sense induction methods investigated in this shared task can be of great value to enable semantic processing of Slavic languages.

*Task Description*

The shared task is structurally similar to prior WSI tasks for the English language, such as SemEval 2007 WSI <http://semeval2.fbk.eu/semeval2.php?location=tasks&taskid=2> and SemEval 2010 WSI&D <https://www.cs.york.ac.uk/semeval2010_WSI/> tasks. We use the “lexical sample” settings. Namely, we provide the participants with the set of contexts representing examples of ambiguous words, like the word “bank” in “In geography, the word *bank* generally refers to the land alongside a body of water.” For each context, a participant needs to disambiguate one target word. Note that, we do not provide any sense inventory: the participant can assign sense identifiers of their choice to a context, e.g., “bank#1” or “bank (area)”.


We provide three training datasets, which can be used for development of the models of various sense inventories and corpora. Once the test datasets will be released, the participants will need to use the developed models to disambiguate the test sentences submitting their final results. Training and testing datasets use the same corpora and annotations approaches, but the target words will be different for training and testing datasets.

*Quality Measure*

Similarly to SemEval 2010 Task 14 WSI&D, we use a gold standard, where each ambiguous target word is provided with a set of instances, i.e., the context containing the target word. Each instance is manually annotated with the single sense identifier according to a predefined sense inventory. Each participating system assigns the sense labels for these ambiguous word occurrences, which can be viewed as a clustering of instances, according to sense labels. To evaluate a system, the system’s labeling of contexts is compared to the gold standard labeling. We use the Adjusted Rand Index (ARI) as the quantitative measure of the clustering.

*Baseline Systems*

We provide a state-of-the-art baseline that demonstrates the task and the data formats. For the knowledge-free track, we particularly encourage participation of various systems based on unsupervised word sense embeddings, e.g. AdaGram. For the knowledge-rich track, word sense embeddings based on inventories based on lexical resources, e.g., AutoExtend, can be obtained on the basis of lexical resources such as RuThes <http://www.labinform.ru/pub/ruthes/index.htm> and RuWordNet <http://ruwordnet.ru/ru/>.

*Important Dates*

- *First Call for Participation*: October 15, 2017.

- *Release of the Training Data*: November 1, 2017.

- *Release of the Test Data*: December 15, 2018.

- *Submission of the Results*: January 15, 2018.

- *Results of the Shared Task*: February 1, 2018. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6919 bytes Desc: not available URL: <https://www.uib.no/mailman/public/corpora/attachments/20171111/e8ea16d9/attachment.txt>

More information about the Corpora mailing list