[Corpora-List] 2nd CfP: BEA2019 GEC Shared Task – Training data released!

Ekaterina Kochmar ek358 at cam.ac.uk
Mon Jan 28 16:18:41 CET 2019

Building Educational Applications 2019 Shared Task: Grammatical Error Correction

NEW! 25/01/2019: Training data released!



Building Educational Applications 2019 Shared Task: Grammatical Error Correction Florence, Italy August 2, 2019


================================================================================ Call for Participation ================================================================================

Grammatical error correction (GEC) is the task of automatically correcting grammatical errors in text; e.g. [I follows his advices -> I followed his advice]. It can be used to not only help language learners improve their writing skills, but also alert native speakers to accidental mistakes or typos.

GEC gained significant attention in the Helping Our Own (HOO) and CoNLL shared tasks between 2011 and 2014, but has since become more difficult to evaluate given a lack of standardised experimental settings. In particular, recent systems have been trained, tuned and tested on different combinations of corpora using different metrics. One of the aims of this shared task is hence to once again provide a platform where different approaches can be trained and tested under the same conditions.

Another significant problem facing the field is that system performance is still primarily benchmarked against the CoNLL-2014 test set, even though this 5-year-old dataset only contains 50 essays on 2 different topics written by 25 South-East Asian undergraduates in Singapore. This means that systems have increasingly overfit to a very specific genre of English and so do not generalise well to other domains. As a result, this shared task introduces the Cambridge English Write & Improve (W&I) corpus, a new error-annotated dataset that represents a much more diverse cross-section of English language levels and domains. Write & Improve is an online web platform that assists non-native English students with their writing (https://writeandimprove.com/).

Participating teams will be provided with training and development data from the W&I corpus to build their systems. Depending on the chosen track, supplementary data may also be used. System output will be evaluated on a blind test set using ERRANT (https://github.com/chrisjbryant/errant).

In addition to learner data, we will provide an annotated development and test set extracted from the LOCNESS corpus, a collection of essays written by native English students compiled by the Centre for English Corpus Linguistics at the University of Louvain.

Tracks ------ There are 3 tracks in the BEA 2019 shared task. Each track controls the amount of annotated data that can be used in a system. We place no restrictions on the amount of unannotated data that can be used (e.g. for language modelling).

* Restricted

In the restricted setting, participants may only use the following annotated datasets: FCE, Lang-8 Corpus of Learner English, NUCLE, W&I and LOCNESS.

Note that we restrict participants to the preprocessed Lang-8 Corpus of Learner English rather than the raw, multilingual Lang-8 Learner Corpus because participants would otherwise need to filter the raw corpus themselves.

* Unrestricted

In the unrestricted setting, participants may use any and all datasets, including those in the restricted setting.

* Unsupervised (or minimally supervised)

In the unsupervised setting, participants may not use any annotated training data. Since current state-of-the-art systems rely on as much training data as possible to reach the best performance, the goal of the unsupervised track is to encourage research into systems that do not rely on annotated training data. This track should be of particular interest to researchers working with low-resource languages. Since we also expect this to be a challenging track however, we will allow participants to use the W&I+LOCNESS development set to develop their systems.

Participation ------------- In order to participate in the BEA 2019 Shared Task, teams are required to submit their system output any time between March 25-29, 2019 at 23:59 GMT. There is no explicit registration procedure. Further details about the submission process will be provided soon.

Important Dates --------------- Friday, Jan 25, 2019: New training data released Monday, March 25, 2019: New test data released Friday, March 29, 2019: System output submission deadline Friday, April 12, 2019: System results announced Friday, May 3, 2019: System paper submission deadline Friday, May 17, 2019: Review deadline Friday, May 24, 2019: Notification of acceptance Friday, June 7, 2019: Camera-ready submission deadline Friday, August 2, 2019: BEA-2019 Workshop (Florence, Italy)

Organisers ---------- Christopher Bryant, University of Cambridge Mariano Felice, University of Cambridge Řistein Andersen, University of Cambridge Ted Briscoe, University of Cambridge

Contact ------- Questions and queries about the shared task can be sent to bea2019st at gmail.com.

Further details can be found at https://www.cl.cam.ac.uk/research/nl/bea2019st/

More information about the Corpora mailing list