[Corpora-List] First call for participation: MT4All Unsupervised MT Shared Task at SIGUL 2022

Ksenia Kharitonova ksenia.kharitonova at gmail.com
Mon Mar 14 12:55:15 CET 2022


MT4All Unsupervised MT Shared Task

at SIGUL 2022

(24-25 June, Marseille)

FIRST CALL FOR PARTICIPATION

We invite you to participate in the first edition of the MT4All Unsupervised Machine Translation Shared Task, hosted by the ELRA/ISCA Special Interest Group on Under-Resourced Languages Workshop (SIGUL 2022). Papers on the task will be published as part of the Proceedings.

Invitation to Participate – Expression of Interest <https://docs.google.com/forms/d/1tllq0jWhcKwMHgPtRCA4aLkgLDuN8JlZG7Vp4TqcNQ0> .

TASK DESCRIPTION

For this Shared task we will leverage the resources generated by the recently finished CEF project MT4All , with the aim of exploring unsupervised MT techniques based only on monolingual corpora. In the course of the project, the following novel datasets were created: 18 monolingual corpora for specific languages and domains, 12 bilingual dictionaries and translation models, and 10 annotated datasets for evaluation. Most of them will be used in the present Shared task.

The task is divided into three separate subtasks, each one covering a specific domain and set of languages.

-

Subtask 1: Unsupervised translation from English to Ukrainian, Georgian

and Kazakh in the Legal domain.

-

Subtask 2: Unsupervised translation from English to Finnish, Latvian,

and Norwegian Bokmål in the Financial domain.

-

Subtask 3: Unsupervised translation from English to German, Norwegian

Bokmål, and Spanish in the Customer support domain.

In this Shared task, we are interested in how the in-domain monolingual data that we will provide can be leveraged by creating a purely unsupervised machine translation model, either by

-

training an unsupervised model from scratch, or

-

adding value to an existing pre-trained model, on the condition that

-

it has been trained on monolingual datasets

-

it has not been fine-tuned with any parallel data

-

it is publicly accessible from the HuggingFace repository

Although we exclude the possibility of fine-tuning the models with any existing parallel data, we allow making use of the bilingual resources created in the framework of MT4All using purely unsupervised technologies.

As additional monolingual data, we allow the use of any monolingual Oscar dataset, only.

IMPORTANT DATES

-

Training data release 10.03.2022

-

Test sets release 25.04.2022

-

Results deadline 02.05.2022

-

Paper submission deadline 16.05.2022

-

Acceptance notice 30.05.2022

-

Camera ready 13.06.2022

-

Workshop starts 24.06.2022

Please visit the website for more details: https://sigul-2022.ilc.cnr.it/mt4all-shared-task/

If you have any comments and/or questions, do not hesitate to contact ksenia.kharitonova at bsc.es. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 28706 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20220314/1b1f8529/attachment.txt>



More information about the Corpora mailing list