We announce and call for participation in the WMT 2020 shared task on assessing the quality of sentence pairs in a parallel corpus.
- In the WMT18 shared task on parallel corpus filtering
<http://www.statmt.org/wmt18/parallel-corpus-filtering.html>, we posed
the challenge of a noisy web-crawled parallel corpus for German-English and
asked participants to score each sentence pair. These quality scores were
used to select subsets of the corpus, consisting of the highest-scoring
sentence pairs, train statistical and neural machine translation systems on
them, and evaluate these on a set of test sets.
- In the WMT19 shared task on parallel corpus filtering for low resorce
we followed the same protocol, but this time for Nepali-English and
Sinhala-English. For low-resource language pairs like these, both existing
clean parallel corpora and the to-be-scored noisy web-crawled data comes in
smaller amounts and lower quality.
This year, we pose two different language pairs, Khmer-English and Pashto-English. In addition to the task of computing quality scores for the purpose of filtering, we also allow for the re-alignment of sentence pairs from document pairs.DETAILSWe provide a very noisy 58.3 million-word (English token count) Khmer-English corpus and a 11.6 million-word Pashto-English corpus. These corpora were partly crawled from the web as part of the Paracrawl <http://paracrawl.eu/> project, and partly extracted from the CommonCrawl <https://commoncrawl.org/> data set. We ask participants to provide scores for each sentence in each of the noisy parallel sets. The scores will be used to subsample sentence pairs that amount to 5 million English words. The quality of the resulting subsets is determined by the quality of a neural machine translation system (fairseq) trained on this data. The quality of the machine translation system is measured by BLEU score (sacrebleu) on a held-out test set of Wikipedia translations <https://github.com/facebookresearch/flores>for Khmer-English and Pashto-English.
We also provide clean parallel and monolingual training data for the two language pairs. This existing data comes from a variety of sources and is of mixed quality and relevance.
Note that the task addresses the challenge of *data quality* and *not domain-relatedness* of the data for a particular use case. While we provide a development and development test set that are also drawn from Wikipedia articles, these may be very different from the final official test set in terms of topics.
The provided raw parallel corpora are the outcome of a processing pipeline that aimed from high recall at the cost of precision, so they are very noisy. They exhibit noise of all kinds (wrong language in source and target, sentence pairs that are not translations of each other, bad language, incomplete of bad translations, etc.).
This year, we also provide the document pairs from which the sentence pairs were extracted (using Hunalign <http://mokk.bme.hu/en/resources/hunalign/>
and LASER <https://github.com/facebookresearch/LASER>). You may align sentences yourself from these document pairs, thus producing your own set of sentence pairs. If you opt to do this, you have to submit all aligned sentence pairs and their quality scores. IMPORTANT DATES Release of raw parallel data March 28, 2020 Submission deadline for subsampled sets July 1, 2020 System descriptions due July 15, 2020 Announcement of results June 29, 2020 Paper notification August 17, 2020 Camera-ready for system descriptions ORGANIZERSPhilipp Koehn, Johns Hopkins University Francisco (Paco) Guzmán, Facebook Vishrav Chaudhary, Facebook Ahmed Kishky, Facebook Naman Goyal, Facebook Peng-Jen Chen, Facebook -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6477 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200328/19664eaa/attachment.txt>