[Corpora-List] WebNLG+ Data now available!

Claire Gardent claire.gardent at loria.fr
Fri Apr 17 16:38:21 CEST 2020

========================================================== WebNLG+: The Second WebNLG Challenge First call for participation : Training data now available! ========================================================== WebNLG goes bi-lingual (English, Russian) and bi-directional (generation and parsing)!

TASKS The challenge comprises two main tasks: - Task 1, RDF-to-text generation: similar to WebNLG 2017 but with new data and into two languages; - Task 2, Text-to-RDF semantic parsing: converting a text into the corresponding set of RDF triples.

For Task 1, given the four RDF triples shown in (a), the aim is to generate a text such as (b) or (c). For Task 2, the opposite should be achieved, i.e. to generate the triples in (a) starting from text as in (b) or (c).

(a) Set of RDF triples <entry category="Company" eid="Id21" size="4"> <modifiedtripleset> <mtriple>Trane | foundingDate | 1913-01-01</mtriple> <mtriple>Trane | location | Ireland</mtriple> <mtriple>Trane | foundationPlace | La_Crosse,_Wisconsin</mtriple> <mtriple>Trane | numberOfEmployees | 29000</mtriple> </modifiedtripleset> </entry>

(b) English text Trane, which was founded on January 1st 1913 in La Crosse, Wisconsin, is based in Ireland. It has 29,000 employees.

(c) Russian text Компания "Тране", основанная 1 января 1913 года в Ла-Кроссе в штате Висконсин, находится в Ирландии. В компании работают 29 тысяч человек.

INDICATIVE DATES - 15 April 2020: Release of Training and Development Data - 30 April 2020: Release of some simple preliminary evaluation scripts to support development - 30 May 2020: Release of the final evaluation scripts - 13 September 2020: Release of Test Data - 27 September 2020: Entry submission deadline - 15-18 December 2020: Results of automatic and human evaluations and system presentations at INLG 2020

DATA DOWNLOAD and REGISTRATION To register for the WebNLG+ task and download the WebNLG+ training and development data, please fill the form below: https://framaforms.org/webnlg-challenge-2020-1586343023

The data, evaluation scripts and system outputs of WebNLG 2017 can also be downloaded here: https://webnlg-challenge.loria.fr/challenge_2017/

EVALUATION For the evaluation phase, starting on July 17th, new test sets will be released for all categories seen in the training data (see above), and for several new unseen categories (categories not included in the training data). For a task, each team can submit more than one system, but can only submit one output per system; in other words, multiple submissions of the same non-deterministic system should be avoided. Participants are free to choose which task and language they want to provide results for (generation and/or semantic parsing, English and/or Russian).

System outputs as well as baseline and human-produced outputs will be evaluated.

For RDF-to-text generation, two evaluations will be carried out: - Automatic evaluation, with standard n-gram-based and embedding-based metrics such as BLEU, METEOR, TER, ChrF++, BERTScore, etc; global and detailed results will be provided (per DBpedia category, per input size, per Category and Input Size, etc.). - Human evaluation: system outputs will be assessed according to criteria such as grammaticality/correctness, appropriateness/adequacy and fluency/naturalness, by native speakers recruited on crowdsourcing platforms.

For Text-to-RDF semantic parsing, the automatic evaluation of three aspects is foreseen, in terms of recall, precision and F1-score: - Property identification. - Subject and Object Identification - Full triple identification.

Initially, preliminary evaluation scripts are released and can be used to test the models. The final evaluation scripts and metrics used for WebNLG+ will be provided at a later stage (see Indicative Dates).

MOTIVATION The WebNLG data was originally created to promote the development of RDF verbalisers able to generate short text and to handle micro-planning (i.e., sentence segmentation and ordering, referring expression generation, aggregation); the data for the first challenge included a total of 15 DBpedia categories. The 2020 challenge aims first of all at increasing the datasets (hence, the coverage of the verbalisers), by covering more categories and an additional language. The other main objective of the 2020 edition is to promote the development of knowledge extraction tools, with a task that mirrors the verbalisation task.

[RDF Verbalisers] The RDF language—in which DBPedia is encoded—is widely used within the Linked Data framework. Many large scale datasets are encoded in this language (e.g., MusicBrainz, FOAF, LinkedGeoData) and official institutions increasingly publish their data in this format. Being able to generate good quality text from RDF data would open the way to many new applications such as making linked data more accessible to lay users, enriching existing text with information drawn from knowledge bases or describing, comparing and relating entities present in these knowledge bases.

[Multilinguality] By providing a bilingual corpus (English and Russian), we aim to promote the development of tools for languages other than English and to allow for experimentation with pre-training and transfer approaches (do the English verbalisations of RDF triples help in better verbalising the triples in Russian?)

[Knowledge extraction] The new semantic parsing task opens up new lines of research in several directions. Can it be used to bootstrap entity linkers? How does RDF-based semantic parsing relate to other semantic parsing tasks where the output semantic representations are lambda terms or KB queries? Can semantic parsing be used to improve generation in ways similar to the back translation approaches proposed in machine translation?

ORGANISING COMMITTEE * Thiago Castro Ferreira, Federal University of Minas Gerais, Brazil * Claire Gardent, CNRS/LORIA, Nancy, France * Nikolai Ilinykh, University of Gothenburg, Sweden * Chris van der Lee, Tilburg University, The Netherlands * Simon Mille, Universitat Pompeu Fabra, Barcelona, Spain * Diego Moussalem, Paderborn University, Germany * Anastasia Shimorina, Université de Lorraine/LORIA, Nancy, France

CONTACT mail: webnlg-challenge at inria.fr website: https://webnlg-challenge.loria.fr/challenge_2020/ twitter: https://twitter.com/webnlg

REFERENCES * Creating Training Corpora for NLG Micro-Planners. C. Gardent, A. Shimorina, S. Narayan and L. Perez-Beltrachini. Proceedings of ACL 2017. Vancouver (Canada). https://www.aclweb.org/anthology/P17-1017.pdf * The WebNLG challenge: Generating text from RDF data. C. Gardent, A. Shimorina, S. Narayan and L. Perez-Beltrachini. Proceedings of INLG, 2017. Santiago de Compostela (Spain). https://www.aclweb.org/anthology/W17-3518.pdf * Building RDF Content for Data-to-Text Generation. L. Perez-Beltrachini, R. Sayed and C. Gardent. Proceedings of COLING 2016. Osaka (Japan). https://www.aclweb.org/anthology/C16-1141.pdf * Enriching the WebNLG corpus. T. Castro Ferreira, D. Moussallem, E. Krahmer and S. Wubben. Proceedings of INLG, 2018. Tilburg (The Netherlands). https://www.aclweb.org/anthology/W18-6521.pdf * Creating a corpus for Russian data-to-text generation using neural machine translation and post-editing. A. Shimorina, E. Khasanova and C. Gardent. Proceedings of BSNLP Workshop, 2019. Florence (Italy). https://www.aclweb.org/anthology/W19-3706.pdf

-- CNRS Equipe SYNALP, LORIA Nancy, France

-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9117 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200417/c63ca951/attachment.txt>

More information about the Corpora mailing list