[Corpora-List] Research internship: 3-6 months, Multiword Expressions: Generalizing over Unseen data (LISN, Orsay, France)

Agata Savary agata.savary at universite-paris-saclay.fr
Fri Oct 22 14:32:02 CEST 2021


How well can deep learning algorithms

generalize over unseen data: A case study in

multiword expression identification

Master internship proposal, 2021-2022


Domain: natural language processing


Location: Université Paris-Saclay (LISN


Gif-sur-Yvette, France


Research teams: ILES


and Sign Language Processing) of the LISN;

TALEP <https://talep.lis-lab.fr/>(Written and

Spoken Language Processing) of the LIS




Agata Savary



Carlos Ramisch



Funding: Université Paris-Saclay


Duration: 3-6 months


Remuneration: around 606€/month

Motivation and context

The aim of this internship is to boost applications in Natural Language Processing (NLP), by focusing on one of their major challenges: multiword expressions (MWEs). MWEs are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For instance, faire‘make/do’ and valoir‘be worth sth’ are verbs, while their combination yields a noun: faire-valoir‘a stooge, a person who is used by somebody to do things that are unpleasant or dishonest’. Similarly, the meaning of casser sa pipe‘to die’ (literally to break one’s pipe) cannot be straightforwardly deduced from the meanings of the individual components. Due to these properties, MWEs are very challenging in applications like machine translation, information retrieval, opinion mining, etc.

A major task related to MWEs is to automatically identify their occurrences in running text (so as to provide more accurate representations to downstream applications). The PARSEME <https://gitlab.com/parseme/corpora/-/wikis/home>network has been addressing this task via a series of shared tasks on automatic identification of verbal MWEs <https://gitlab.com/parseme/corpora/-/wikis/home#shared-tasks>. Edition 1.1 of the PARSEME shared task (in 2018) showed critical hardness of identifying MWEs which have not been previously seen in the training corpus. Edition 1.2 saw the advent of transformer-based language models (BERT), which brought substantial progress to MWE identification performances. Still, only modest progress was achieved in generalization over unseen data.


The aim of this internship is to better understand the potential of transformer-based models in generalising over unseen data in MWE identification. More precisely we wish to:


analyze the results of edition 1.2


the PARSEME shared task, and in particular

those related to unseen data


propose an error analysis methodology for MWEs

which are and are not correctly identified,

and try to understand the reasons behind this

state of the affairs


put forward recommendations for future

enhancements of the state-of-the-art MWE



(depending on the candidate's profile and the

length of the internship) implement a

prototype based on these recommendations

Candidate's profile


2nd-year master student in computational

linguistics, computer science or alike ;

excellent 1st-year master ou 3rd year bachelor

students will also be considered


Interests in linguistics and familiarity with

language technology


Good programming skills, preferably in Python

Important dates


Application deadline: 20 November 2021(or

until filled)


Notification: 30 November 2021


Position starts: late January 2022 (at earliest)


Position ends: around late July 2022 (or later)

How to apply

Send your CV and a transcript of your bachelor and master grades to Agata Savary <first.last at universite-paris-saclay.fr> and Carlos Ramisch <first.last at lis-lab.fr>.



Baldwin, T. and Kim, S. N. (2010)Multiword



in Nitin Indurkhya and Fred J. Damerau (eds.)

Handbook of Natural Language Processing,

Second Edition, CRC Press, Boca Raton, USA,

pp. 267-292.


Matthieu Constant, Gülşen Eryiğit, Johanna

Monti, Lonneke van der Plas, Carlos Ramisch,

Michael Rosner, and Amalia Todirascu.

2017.Multiword expression processing: A survey


Computational Linguistics, 43(4):837–892.


Carlos Ramisch, Agata Savary, Bruno Guillaume,

Jakub Waszczuk, Marie Candito, Ashwini Vaidya,

Verginica Barbu Mititelu, Archna Bhatia, Uxoa

Ińurrieta, Voula Giouli, Tunga Güngör, Menghan

Jiang, Timm Lichte, Chaya Liebeskind, Johanna

Monti, Sara Stymne, Abigail Walsh, Renata

Ramisch, Hongzhi Xu (2020) Edition 1.2 of the

PARSEME Shared Task on Semi-supervised

Identification of Verbal Multiword Expressions


in the Proceedings of the Joint Workshop on

Multiword Expressions and Electronic Lexicons

(MWE-LEX 2020), 13 December 2020, Barcelona,

Spain (online).

Agata Savary, Silvio Ricardo Cordeiro, Carlos Ramisch (2019)Without lexicons, multiword expression identification will never fly: A position statement <https://www.aclweb.org/anthology/papers/W/W19/W19-5110/>, In the Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 2 August 2019, Florence, Italy.* -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 35452 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20211022/99a3f155/attachment.txt>

More information about the Corpora mailing list