This new shared task tackles the problem of cleaning noisy parallel corpora. Given a noisy parallel corpus (crawled from the web), participants develop methods to filter it to a smaller size of high quality sentence pairs.
*DETAILS* We provide a very noisy 1 billion word (English token count) German-English corpus crawled from the web as part of the Paracrawlproject. We ask participants to subselect sentence pairs that amount to (a) 100 million words, and (b) 10 million words. The quality of the resulting subsets is determined by the quality of a statstical machine translation (Moses, phrase-based) and neural machine translation system (Marian) trained on this data. The quality of the machine translation system is measured by BLEU score on the (a) official WMT 2018 news translation test set and (b) another undisclosed test set.
*IMPORTANT DATES* Release of raw parallel data: April 1, 2018 Submission deadline for subsampled sets: June 22, 2018 Announcement of results: July 9, 2018 System descriptions due: July 27, 2018 Camera-ready for system descriptions: August 31, 2018
*ORGANIZERS* Philipp Koehn (Johns Hopkins University / University of Edinburgh) Huda Khayrallah (Johns Hopkins University) Kenneth Heafield (University of Edinburgh) Mikel Forcada (University of Alicante)
*ACKNOWLEDGEMENTS* This shared task is partially supported by a Google Faculty Research Award and the Connecting Europe Facility via the Paracrawl project. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 5860 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20180615/7bb0396e/attachment.txt>