There is no machine translation available for most of the ~7000 languages spoken on the planet Earth. This is because very limited or no parallel corpora are available. Research on unsupervised and very low resource machine translation is important for alleviating this problem. Unsupervised machine translation requires only monolingual data, while very low resource supervised machine translation uses very limited parallel data.
At WMT 2018 and WMT 2019, the first shared task and second shared task on Unsupervised Machine Translation (UMT), were held as part of the news translation track. In 2018, the language pairs were Turkish-English, Estonian-English and German-English. In 2019, we also tested "simulated" unsupervised systems for German to Czech unsupervised translation (where no German/Czech parallel data was allowed).
We now propose a third edition on UMT, which aims at a more realistic scenario, German to Upper Sorbian (and Upper Sorbian to German) translation. Upper Sorbian is a minority language of Germany that is in the Slavic language family (e.g., related to Lower Sorbian, Czech and Polish), and we provide here most of the digital data that is available, as far as we know.
As we were very recently able to obtain a very small amount of parallel data for this language pair, we also offer a very low resource supervised translation task.
The tasks are:
- Unsupervised Machine Translation: German to Upper Sorbian. Upper Sorbian to German.
- Very Low Resource Supervised Machine Translation: German to Upper Sorbian. Upper Sorbian to German.
For further information and train/test data, please see:
Thanks and kind regards, Alexander Fraser CIS, LMU Munich -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2172 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200311/73bd0057/attachment.txt>