[Corpora-List] DiscoMT 2015 Shared Task on Pronoun Translation (at EMNLP 2015)

Jrg Tiedemann Jorg.Tiedemann at lingfil.uu.se
Fri Feb 27 13:39:20 CET 2015

=============================================== DiscoMT 2015 Shared Task on Pronoun Translation ===============================================

Website: https://www.idiap.ch/workshop/DiscoMT/shared-task In connection with EMNLP 2015 (http://emnlp2015.emnlp.org)

We are happy to announce a new exciting task for people interested in (discourse-aware) machine translation, anaphora resolution and machine learning in general. The EMNLP 2015 Workshop on Discourse in Machine Translation features two shared tasks:

Task 1: Pronoun-Focused Machine Translation Task 2: Cross-Lingual Pronoun Prediction

Task 1 requires machine translation (from English to French) and focuses on the evaluation of translated pronouns. We provide training data and a baseline SMT model to get started.

Task 2 is a straightforward classification task in which one has to predict the correct translation of a given pronoun in English (it or they) into French (ce, elle, elles, il, ils, a, cela, on, OTHER). We provide training and development data and a simple baseline system using an N-gram language model.

More details of the two tasks are attached below and can be found at our website: https://www.idiap.ch/workshop/DiscoMT/shared-task

Important Dates:

4 May, 2015 Release of the MT test set (task 1) 10 May, 2015 Submission of translations (task 1) 11 May, 2015 Release of the classification test set (task 2) 18 May, 2015 Submissions of classification results (task 2) 28 May, 2015 System paper submission deadline Sep., 2015 Workshop in Lisbon

Mailing list: https://groups.google.com/d/forum/discomt2015

Downloads: https://www.dropbox.com/sh/c8qnpag5z29jyh6/AAAAqk1TE9-UvcgEnfccdRwxa?dl=0 Download alternative 1: http://opus.lingfil.uu.se/DiscoMT2015/ Download alternative 2: http://stp.lingfil.uu.se/~joerg/DiscoMT2015/

------------------------------------------------------------------------- Acknowledgements: Funding for the manual evaluation of the pronoun-focused translation task is generously provided by the European Association for Machine Translation (EAMT) -------------------------------------------------------------------------

========================== Detailed Task Description: ==========================

* Overview

The DiscoMT 2015 shared task will consist of two subtasks, relevant to both the MT and discourse communities: pronoun-focused translation, a practical MT task, and cross-lingual pronoun prediction, a classification task that requires no specific MT expertise and is interesting as a machine learning task in its own right. For groups wishing to participate in both tasks, one possibility is to convert a system for the classification task into an MT feature model using existing software such as the Docent decoder (Hardmeier et al., ACL 2013). Both tasks use the English–French language pair, which has a sufficiently high baseline performance to produce basically intelligible output, as well as interesting differences in their pronoun systems.

* Task 1: Pronoun-Focused Translation Task

In the pronoun-focused translation task, you are given a collection of English input documents, which you are asked to translate into French. This task is the same as for other MT shared tasks such as that of WMT. The difference is in the way the translations are evaluated. Instead of checking the overall translation quality, we specifically look at how the English subject pronouns it and they were translated. The principal evaluation will be carried out manually and will focus specifically on the correctness of pronoun translation. Thanks to a grant from the EAMT, the manual evaluation will be run by the organisers and participants don't have to contribute evaluations. Automatic reference-based metrics are available for development purposes.

The texts in the test corpus will consist of transcripts of TED talks. The training data contains an in-domain corpus of TED talks as well as some additional data from Europarl and news texts. To make the participating systems as comparable as possible, we ask you to constrain the training data of your system to the resources listed below as far as you can, but this is not a strict requirement and we do accept submissions using additional resources. If your system uses any resources other than those of the official data release, please be specific about what was included in the system description paper. For the same reason, we also suggest that you use the tokeniser provided by us unless you have a good reason to do otherwise.

The test set will be supplied in the XML source format of the 2009 NIST MT evaluation, which is described on the last page of this document. See the development set included in the data release for an example. Your translation should be submitted in the XML translation format of the 2009 NIST MT evaluation. We also need you to submit, in a separate file, word alignments linking occurrences of the pronouns it and they (case-insensitive) to the corresponding words generated by your MT system. The format of the word alignments should be the same as that of the alignments included in the cross-lingual pronoun prediction data (see below). Word alignments can be obtained, for instance, by running the Moses SMT decoder with the -print-alignment-info option or by parsing the segment-level comments added to the output by the Docent decoder. You may submit alignments for the complete sentence if it's easier for you, but only links for it and they will be used. If your MT system cannot output word alignments, please contact the shared task organisers to discuss how to proceed. We'll try to find a solution. More details on how to submit will be added to this page later.

The test set will be released on 4 May 2015, and your translations are due on 10 May 2015. Note that we will ensure that each document in the test set contains an adequate number of challenging pronouns, so the corpus-level distribution of the pronouns in the test set may differ from that of the training corpus. However, each document will be a complete TED talk with a naturally occurring ensemble of pronouns.

* Task 2: Cross-Lingual Pronoun Prediction

In the cross-lingual pronoun prediction task, you are given an English document with a human-generated French translation and a set of word alignments between the two languages. In the French translation, the words aligned to the English third-person subject pronouns it and they are substituted by placeholders. Your task is to predict, for each placeholder, the word that should go there from a small, closed set of classes, using any information you can extract from the documents. The following classes exist:

ce The French pronoun ce (sometimes with elided vowel as c') as

in the expression c'est 'it is' elle feminine singular subject pronoun elles feminine plural subject pronoun il masculine singular subject pronoun ils masculine plural subject pronoun a demonstrative pronoun (including the misspelling ca and the

rare elided form ') cela demonstrative pronoun on indefinite pronoun OTHER some other word, or nothing at all, should be inserted

This task will be evaluated automatically by matching the predictions against the words found in the reference translation by computing the overall accuracy and precision, recall and F-score for each class. The primary score for the evaluation is the macro-averaged F-score over all classes. Compared to accuracy, the macro-averaged F-score favours systems that consistently perform well on all classes and penalises systems that maximise the performance on frequent classes while sacrificing infrequent ones.

The data supplied for the classification task consists of parallel English-French text with word alignments. In the French text, a subset of the words aligned to English occurrences of it and they have been replaced by placeholders of the form REPLACE_xx, where xx is the index of the English word the placeholder is aligned to. Your task is to predict one of the classes listed above for each occurrence of a placeholder.

The training and development data is supplied in a file format with five tab-separated columns:

1. the class label 2. the word actually removed from the text (may be different from the

class label for class OTHER and in some edge cases) 3. the English source segment 4. the French target segment with pronoun placeholders 5. the word alignment (a space-separated list of alignments of the form

SRC-TGT, where SRC and TGT are zero-based word indices in the source

and target segment, respectively)

A single segment may contain more than one placeholder. In that case, columns 1 and 2 contain multiple space-separated entries in the order of placeholder occurrence. A document segmentation of the data is provided in separate files for each corpus. These files contain one line per segment, but the precise format varies depending on the type of document markup available for the different corpora. In the development and test data, the files have a single column containing the ID of the document the segment is part of.

Here's an example line from one of the training data files:

elles Elles They arrive first . REPLACE_0 arrivent en premier . 0-0 1-1 2-3 3-4

The test set will be supplied in the same format, but with columns 1 and 2 (elles and Elles) empty, so each line starts with two tab characters. Your submission should have the same format as column 1 above, so a correct solution would contain the class label elles in this case. Each line should contain as many space-separated class labels as there are REPLACE tags in the corresponding segment. For each segment not containing any REPLACE tags, an empty line should be emitted. Additional tab-separated columns may be present in the submission, but will be ignored. Note in particular that you are not required to predict the second column. The submitted files should be encoded in UTF-8 (like the data we provide).

The test set will be the same as for the pronoun-focused translation task. The complete test data for the classification task, including reference translations and word alignments, will be released on 11 May 2015, after the completion of the translation task. Your submission is due on 18 May 2015. Details on how to submit will be added to our website later.

Note: If you create a classifier for this task, but haven't got an MT system of your own, you might consider using your classifier as a feature function in the document-level SMT decoder Docent to create a submission for the pronoun translation task.

* Discussion Group

If you are interested in participating in the shared task, we recommend that you sign up to our discussion group to make sure you don't miss any important information. Feel free to ask any questions you may have about the shared task!


* Training Data and Tools

All training and development data for both subtasks can be downloaded from the following location:

https://www.dropbox.com/sh/c8qnpag5z29jyh6/AAAAqk1TE9-UvcgEnfccdRwxa?dl=0 Download alternative 1: http://opus.lingfil.uu.se/DiscoMT2015/ Download alternative 2: http://stp.lingfil.uu.se/~joerg/DiscoMT2015/

The dropbox folder contains many files, see the list below. To create a system for the pronoun classification task, you should start with the classification training data. For the pronoun-focused translation task, we provide both the original training data, preprocessed data sets including full word alignments and a complete pre-trained phrase-based SMT system. To minimise preprocessing differences among the submitted system we suggest (but do not require) that you start from the most processed version of the data that is usable for the type of system that you plan to build.

Look at the README file for more information about the individual files we provide: http://stp.lingfil.uu.se/~joerg/DiscoMT2015/README

* Classification Baseline

We have a baseline model for the classification task that looks only at the language model scores (using KenLM, and the language model that is used needs to be in KenLM's binary format (which is the case for the "corpus.5.trie.kenlm" included in the "baseline-all" tarball).

Results with default options on TEDdev (same data as tst2010):

ce : P = 110/ 129 = 85.27% R = 110/ 148 = 74.32% F1 = 79.42%

cela : P = 4/ 15 = 26.67% R = 4/ 10 = 40.00% F1 = 32.00%

elle : P = 6/ 13 = 46.15% R = 6/ 30 = 20.00% F1 = 27.91% elles : P = 4/ 12 = 33.33% R = 4/ 16 = 25.00% F1 = 28.57%

il : P = 35/ 137 = 25.55% R = 35/ 55 = 63.64% F1 = 36.46%

ils : P = 86/ 94 = 91.49% R = 86/ 139 = 61.87% F1 = 73.82%

on : P = 3/ 10 = 30.00% R = 3/ 10 = 30.00% F1 = 30.00%

a : P = 16/ 22 = 72.73% R = 16/ 61 = 26.23% F1 = 38.55% OTHER : P = 225/ 315 = 71.43% R = 225/ 278 = 80.94% F1 = 75.89%

or a macro-averaged fine-grained F1 of 46.96%

Results with "--null-penalty -2.0"

ce : P = 121/ 145 = 83.45% R = 121/ 148 = 81.76% F1 = 82.59%

cela : P = 4/ 21 = 19.05% R = 4/ 10 = 40.00% F1 = 25.81%

elle : P = 7/ 15 = 46.67% R = 7/ 30 = 23.33% F1 = 31.11% elles : P = 5/ 14 = 35.71% R = 5/ 16 = 31.25% F1 = 33.33%

il : P = 36/ 143 = 25.17% R = 36/ 55 = 65.45% F1 = 36.36%

ils : P = 99/ 109 = 90.83% R = 99/ 139 = 71.22% F1 = 79.84%

on : P = 3/ 13 = 23.08% R = 3/ 10 = 30.00% F1 = 26.09%

a : P = 19/ 32 = 59.38% R = 19/ 61 = 31.15% F1 = 40.86% OTHER : P = 211/ 255 = 82.75% R = 211/ 278 = 75.90% F1 = 79.17%

or a fine-grained F1 score of 48.35%

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 20619 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150227/0181686c/attachment.txt>

More information about the Corpora mailing list