[Corpora-List] 1st Call for Participation - CAPITEL-EVAL @ IberLef 2020

Luis Espinosa Anke luis.espinosa83 at gmail.com
Tue Mar 10 17:11:37 CET 2020




When: September, 22-25

Where: SEPLN 2020 (Málaga, Spain)

Deadline for systems output: May, 17

Webpage: https://sites.google.com/view/capitel2020


Within the framework of the PlanTL <https://www.plantl.gob.es/>, the Royal Spanish Academy (RAE <http://www.rae.es/>) and the Secretariat of State for Digital Advancement (SEAD <https://avancedigital.gob.es/>) of the Ministry of Economy signed an agreement for developing a linguistically annotated corpus of Spanish news articles, aimed at expanding the language resource infrastructure for the Spanish language. The name of such corpus is CAPITEL (Corpus del Plan de Impulso a las Tecnologías del Lenguaje), and is composed of contemporary news articles thanks to agreements with a number of news media providers. CAPITEL has three levels of linguistic annotation: morphosyntactic (with lemmas and Universal Dependencies-style POS tags <https://universaldependencies.org/u/pos/index.html> and features <https://universaldependencies.org/u/feat/index.html>), syntactic (following Universal Dependencies v2 <https://universaldependencies.org/u/dep/index.html>), and named entities.

The linguistic annotation of a subset of the CAPITEL corpus has been revised using a machine-annotation-followed-by-human-revision procedure. Manual revision has been carried out by a team of graduated linguists using the Annotation Guidelines created specifically for CAPITEL. The named entity and syntactic layers of revised annotations comprise about 1 million words for the former, and roughly 250,000 for the latter. Due to the size of the corpus and the nature of the annotations, we propose two IberLEF <https://sites.google.com/view/iberlef2020> sub-tasks under the more general, umbrella task of CAPITEL @ IberLEF 2020, where we will use the revised subset of the CAPITEL corpus in two challenges, namely:

(1) Named Entity Recognition and Classification


(2) Universal Dependency Parsing

Because of the ever-evolving nature of the NLP field and its associated shared task competitions, we deem it relevant to propose new challenges for the Spanish language to determine whether recent developments can push the boundaries of the current state of the art.

Sub-task 1: Named Entity Recognition and Classification in Spanish News Articles

Information extraction tasks, formalized in the late 1980s, are designed to evaluate systems which capture pieces of information present in free text, with the goal of enabling better and faster information and content access. One important set of such information are named entities (NE) which, roughly speaking, are textual elements corresponding to names of people, places, organizations and others. Three processes can be applied to NEs: recognition (or identification), categorization (assigning a type according to a predefined set of semantic categories), and linking (disambiguating the reference).

The aim of this sub-task is to challenge participants to apply their systems or solutions to the problem of identifying and classifying NEs in Spanish news articles. This two-stage process is referred to as NERC (Named Entity Recognition and Classification).

Sub-task 2: Universal Dependency Parsing of Spanish News Articles

Dependency-based syntactic parsing has become popular in NLP in recent years. One of the reasons for this popularity is the transparent encoding of predicate-argument structures, which is useful in many downstream applications. Another reason is that it is better suited than phrase-structure grammars for languages with free or flexible word order.

Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features and syntactic dependencies) across different human languages. Moreover, the UD initiative is an open community effort with over 200 contributors which has produced more than 100 treebanks in over 70 languages.

The aim of this sub-task is to challenge participants to apply their systems or solutions to the problem of Universal Dependency parsing of Spanish news articles as defined in the Annotation Guidelines for the CAPITEL corpus that will be shared with the participants.

Important Dates

March, 15: Sample set, Evaluation script and Annotation Guidelines released.

March, 17: Training set released.

April, 1: Development set released.

April, 29: Test set released (includes background set).

May, 17: Systems output submissions.

May, 21: Results posted and Test set with GS annotations released.

May, 31: Working notes paper submission.

June, 15: Notification of acceptance (peer-reviews).

June, 30: Camera ready paper submission.

September: IberLEF 2020 Workshop.

Organizing Committee

David Pérez Fernández, PlanTL - Ministry of Economy, Spain.

Jordi Porta-Zamorano, Centro de Estudios de la RAE, Spain.

José-Luis Sancho-Sánchez, Centro de Estudios de la RAE, Spain.

Rafael-J. Ureña-Ruiz, Centro de Estudios de la RAE, Spain.

Doaa Samy, Instituto de Ingeniería del Conocimiento (PlanTL-GTO), Spain.

Luis Espinosa-Anke, School of Computer Science and Informatics, Cardiff University, UK.


Jordi Porta-Zamorano (porta at rae.es)

Organizers mailing list

capitel2020org at googlegroups.com

Task-specific mailing lists

capitel2020nerc at googlegroups.com

capitel2020ud at googlegroups.com -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 29009 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200310/0c8b06ed/attachment.txt>

More information about the Corpora mailing list