[Corpora-List] Second Call for Participation - Third VarDial Evaluation Campaign

Zampieri, Marcos M.Zampieri at wlv.ac.uk
Mon Jan 28 13:26:11 CET 2019

Call for Participation - Third VarDial Evaluation Campaign

Within the scope of the sixth VarDial workshop, co-located with NAACL 2019, we are organizing an evaluation campaign on similar languages, varieties and dialects with five shared tasks.

URL: https://sites.google.com/view/vardial2019/campaign

The tasks we are organizing this year are the following (please check the website for more information):

- German Dialect Identification (GDI): After two successful editions of the (Swiss) German Dialect Identification task, we propose to organize a third iteration of this task. We will again focus on four Swiss German dialect areas (Basel, Bern, Lucerne, Zurich). We provide updated speech transcripts for all dialect areas and also release corresponding acoustic data in the form of iVectors as well as (predicted) word-level normalisation. In particular, the acoustic data may help to overcome transcriber bias; the recent iterations of the ADI task have already shown that acoustic features substantially improve dialect identification.

- Cross-lingual Morphological Analysis (CMA): We introduce the task of cross-lingual morphological analysis. Given a word in an unknown related language, for example "navifraghju" ("shipwreck" in Corsican), a human speaker of several related languages is able to deduce that it is a noun in the singular by making deductions from similar words, for example: "naufragi" (Catalan), "naufragio" (Spanish, Italian), "naufrįgio" (Portuguese) and "naufrage" (French). In this task we invite participants to create computational models which will be able to do the same. There will be two language families represented, Romance (fusional morphology) and Turkic (agglutinative morphology). In the "Closed" track, participants will be given a set of word forms with all valid morphological analyses in six languages and asked to predict the valid morphological analyses for a seventh, unseen language. In the "Semi-Closed" track, the process will be the same, only participants will be provided with additional raw data by the organisers. This will take the form of raw text Wikipedia dumps, bilingual dictionaries from the Apertium project and any treebanks available in the known languages from the Universal Dependencies project.

- Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT): Like English, Mandarin has several varieties among the speaking communities. This task aims at discriminating between two major varieties of Mandarin Chinese: Putonghua (Mainland China) and Guoyu (Taiwan). We provide a corpus of approximately 10,000 sentences belonging to the domain of news for each of the Mandarin variation. The main task will be to determine if the sentence belongs to news articles from Mainland China or from Taiwan. The sentences are tokenized and punctuation is removed from the texts. Both the traditional and the simplified versions of the same corpus are available.

- Moldavian vs. Romanian Cross-topic Identification (MRC): In the Moldavian vs. Romanian Cross-topic Identification shared task we provide participants with the MOROCO data set which contains Moldavian and Romanian samples of text collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports, tech. The samples are preprocessed in order to eliminate named entities. For each sample, the data set provides corresponding dialectal and category labels. To this end, we propose three sub-tasks for the 2019 VarDial Evaluation Campaign. The first sub-task is a binary classification by dialect task, in which a classification model is required to discriminate between the Moldavian and the Romanian dialects. The second sub-task is a Moldavian to Romanian cross-dialect multi-class classification by topic task, in which a model is required to classify the samples written in the Romanian dialect into six topics, using samples written in the Moldavian dialect for training. Finally, the third sub-task is a Romanian to Moldavian cross-dialect multi-class classification by topic task, in which a model is required to classify the samples written in the Moldavian dialect into six topics, using samples written in the Romanian dialect for training.

- Cuneiform Language Identification (CLI): This task focuses on discriminating between languages and dialects originally written using the cuneiform signs. The task includes 2 different languages: Sumerian and Akkadian. Furthermore, the Akkadian language is divided into six dialects: Old Babylonian, Middle Babylonian peripheral, Standard Babylonian, Neo Babylonian, Late Babylonian, and Neo Assyrian. These languages and dialects were used in ancient Mesopotamia and span a time period of 3,000 years. For training and development, we provide the participants with varying amounts of text encoded in Unicode cuneiform signs for each language or dialect. We are interested in seeing whether the task of language identification between dialects using the same logosyllabic writing system is different from language identification between languages using segmental scripts.

To participate and to receive the training data please fill the registration form available on the workshop website. The training sets will be released on February 5, 2019.

Best, Marcos

----- Dr. Marcos Zampieri Research Group in Computational Linguistics University of Wolverhampton, UK http://pers-www.wlv.ac.uk/~u22984/

More information about the Corpora mailing list