[Corpora-List] Call for Participation - Vardial Evaluation Campaign 2021

Zampieri, Marcos M.Zampieri at wlv.ac.uk
Fri Dec 18 05:17:20 CET 2020

Call for Participation - VarDial Evaluation Campaign 2021

Within the scope of the eighth VarDial workshop, co-located with EACL 2021, we are organizing an evaluation campaign on similar languages, varieties and dialects with four shared tasks.

URL: https://sites.google.com/view/vardial2021/evaluation-campaign

To participate and to receive the training data please fill the registration form available on the workshop website. The training sets will be released Monday, December 21, 2020.

The tasks we are organizing this year are the following (please check the website for more information):

- DLI - Dravidian Language Identification: Dravidian languages are a language family spoken mainly in the south of India. The four major literary Dravidian languages are Tamil (ISO 639-3: tam), Telugu (ISO 639-3: tel), Malayalam (ISO 639-3: mal), and Kannada (ISO 639-3: kan). Tamil, Malayalam, and Kannada are closely related belonging to the South Dravidian subgroup. The DLI shared task provides participants with a collection of 16,672 YouTube comments as training set. The comments contain code-mixed sentences with English and one of the South Dravidian language (Tamil, Malayalam or Kannada). All comments were written in Roman script (Non-native script). The task is to identify the language of each comment.

- RDI - Romanian Dialect Identification: In this second iteration of the Romanian Dialect Identification (RDI) shared task we provide participants with an augmented version of the MOROCO data set for training, which contains Moldavian (MD) and Romanian (RO) samples of text collected from the news domain. A new test set has been collected which will allow participants to improve the results they obtained in VarDial 2020. The task is a binary classification by dialect, in which a classification model is required to discriminate between the Moldavian (MD) and the Romanian (RO) dialects. The task is closed, therefore, participants are not allowed to use external data to train their models. The test set will contain newly collected text samples, not previously included in MOROCO. The test samples will come from a different domain, hence the methods have to take the cross-domain nature of the task into account. RDI participants may use other external resources in their systems, e.g. unlabelled corpora, lexicons, pre-trained embeddings, etc.

- SMG - Social Media Variety Geolocation: In this second iteration of the SMG task, we again focus on a geolocation (rather than identification) task: given a text, the participants have to predict its geographic location in terms of latitude/longitude coordinates. Using data from the social media platforms Twitter and Jodel, we provide extended datasets for the same three subtasks as in 2020:: 1. Standard German Jodels; 2. Swiss German Jodels; 3. BCMS Tweets. All three subtasks will use the same data format and evaluation methodology, and participants are encouraged to submit their systems for all subtasks.

- ULI - Uralic Language Identification: This task focuses on discriminating between the languages in the Uralic group as defined by the ISO 639-3 standard. This is an open public leaderboard competition following VarDial 2020 where participants can submit at any point until the final submission date. The task includes 29 individual relevant languages, some of which are extremely closely related and similar, such as Kven Finnish (fkv) and Tornedalen Finnish (fit). These languages are used from Scandinavia, Estonia, and Finland all the way to the Russian Siberia.

Best, Marcos

More information about the Corpora mailing list