[Corpora-List] Call for Participation - Vardial Evaluation Campaign 2020

Zampieri, Marcos M.Zampieri at wlv.ac.uk
Mon Mar 23 05:26:22 CET 2020

Call for Participation - VarDial Evaluation Campaign 2020

Within the scope of the seventh VarDial workshop, co-located with COLING 2020, we are organizing an evaluation campaign on similar languages, varieties and dialects with three shared tasks.

URL: https://sites.google.com/view/vardial2020/evaluation-campaign

To participate and to receive the training data please fill the registration form available on the workshop website.

The tasks we are organizing this year are the following (please check the website for more information):

- RDI - Romanian Dialect Identification: In the Romanian Dialect Identification (RDI) shared task we provide participants with the MOROCO data set for training, which contains Moldavian (MD) and Romanian (RO) samples of text collected from the news domain. The task is a binary classification by dialect, in which a classification model is required to discriminate between the Moldavian (MD) and the Romanian (RO) dialects. The task is closed, therefore, participants are not allowed to use external data to train their models. The test set will contain newly collected text samples, not previously included in MOROCO. The test samples will come from a different domain, hence the methods have to take the cross-domain nature of the task into account.

- SMG - Social Media Variety Geolocation: Most existing VarDial tasks are language identification tasks: they are framed as classification tasks in which each instance is associated with a language variety label. For many language areas, defining a set of discrete labels is not trivial, as there is a continuum between varieties rather than clear-cut borders. Therefore, we introduce a geolocation task this year: given a text, the participants have to predict its geographic location in terms of latitude/longitude coordinates. Geolocation can be framed as a double regression task, but more sophisticated model architectures have been proposed (e.g., Rahimi et al. 2017a, 2017b). Using data from the social media platforms Twitter and Jodel, we provide three subtasks for three language areas: 1. Standard German Jodels; 2. Swiss German Jodels; 3. BCMS Tweets. All three subtasks will use the same data format and evaluation methodology, and participants are encouraged to submit their systems for all subtasks.

- ULI - Uralic Language Identification: This task focuses on discriminating between the languages in the Uralic group as defined by the ISO 639-3 standard. The task includes 29 individual relevant languages, some of which are extremely closely related and similar, such as Kven Finnish (fkv) and Tornedalen Finnish (fit). These languages are used from Scandinavia, Estonia, and Finland all the way to the Russian Siberia. Many of the languages used within Russia are written using modified Cyrillic alphabets. Most of the included languages can be defined as under-resourced, for example, Karelian (krl) and Livvi-Karelian (olo), which have less than 40,000 native speakers combined. Even more challenging examples are Nganasan, with estimated 125 speakers and very limited online presence, and Kemi Sami, which is extinct and even scarcely documented. We acknowledge that the ISO 639-3 classification which we have used may not be without problems, but especially within the purposes of this shared task it identifies these 29 language varieties adequately. Three tracks are available in this shared task. More information and data available on the website.

Best, Marcos

More information about the Corpora mailing list