Information is delivered under several forms depending on their content and audience: digital newspapers, academic publications, blogs... and, among them, educational texts. Understanding a text is not an easy task for everyone, there are those who find it difficult to understand their language. Within the information society, all persons should be able to access all available information easily and simply, so improving access to written language is a topic of growing interest. For many people, the content of the texts represents a barrier in their understanding as it contains complex information due to the use of sophisticated and specialized vocabulary, which implies a difficult reading, especially for those with special abilities who fail to achieve understand its content due to the presence of complex words, long sentences or passive sentences, as well as infrequent words Students hold different levels in reading comprehension, being the main barrier the vocabulary present in the text: it is more important to understand words than grammar complexity in most cases. We propose a Complex Word Identification (CWI) task on educational text at university level.
Although there are some tasks were Spanish is considered for Complex Word Identification, like the CWI Shared Tasks in SemEval 2016 and NAACL-HTL 2018, we have created a new annotated corpus of transcriptions of teaching classes at the University of Guayaquil (Ecuador), the VYTEDU-CW corpus. We believe this resource can be used to test complex words identification systems, configured to fit in an educational scope.
The goal is to mark those words that can be considered complex, in the sense of difficult comprehension for the reader. The corpus used will be the VYTEDU-CW corpus. There are some interesting challenges in this task compared to other CWI tasks: 1. Difficult terms have to be within the scope of an academic content. That is, many technical terms may need to be superseded as they are commonly used in the domain. 2. There are several domains corresponding to different grades, so the system has to adapt to them. 3. No training data will be released, only dev data for adjusting systems to file formats. Therefore, non-supervised or semi-supervised approaches are applicable.
An adhoc corpus has been created. The collection contains 55 transcripted videos, with more than 1200 words per transcription on average and 723 words annotated as complex. The details of the corpus will be made available to participants.
- Release of training and development corpora: Feb 1, 2020
- Release of test corpora: May 1, 2020
- Deadline for evaluation: May 12, 2020
- Paper submission: May 25, 2020
- Review notification: June 15, 2020
- Camera ready submission: July 3, 2020
- Publication: September 2020
- Workshop: Málaga (CEDI 2020), September 2020
- Arturo Montejo Ráez, University of Jaén, Spain.
- Jenny Ortiz Zambrano, Doctoral Candidate in University of Jaén, Spain.
- Miguel Botto-Tobar, Doctoral Candidate at Eindhoven University of
- Maikel Yelandi Leyva Vázquez, Professor of Artificial Intelligence at
Polytechnic University, Ecuador.
- Elsy Rodríguez Revelo, professor at the State University of Guayaquil,
of Mathematical and Physical Sciences.
- Richard Avilés López, Doctoral Candidate at University of Granada,
[image: Universidad de Jaén] <https://www.ujaen.es/> Arturo Montejo Ráez Profesor Titular de Universidad amontejo at ujaen.es
Universidad de Jaén Departamento de Informática Edificio A3, despacho 114 Las Lagunillas s/n, 23071 - Jaén | +34 953 212 882 <https://www.ujaen.es/servicios/sinformatica/sites/servicio_sinformatica/files/piefirmacorreo4/index.html> ID de investigador: http://orcid.org/0000-0002-8643-2714 [image: Universidad de Jaén] <https://www.ujaen.es/> *Antes de imprimir este mensaje, piense si es necesario. Proteger el medio ambiente es cosa de todos.* *** CLÁUSULA DE CONFIDENCIALIDAD *** Este mensaje se dirige exclusivamente a su destinatario y puede contener información privilegiada o confidencial. Si no es Ud. el destinatario indicado, queda notificado de que la utilización, divulgación o copia sin autorización está prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, se ruega lo comunique inmediatamente por esta misma vía y proceda a su destrucción.
This message is intended exclusively for its recipient and may contain information that is CONFIDENTIAL. If you are not the intended recipient you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited by law. If this message has been received by mistake, please let us know immediately via e-mail and delete it. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 9959 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20200304/fa5ef2f8/attachment.txt>