[Corpora-List] Research intership : 3-6 months, Multiword Expressions: Quantifying Language Diversity (LISN, Orsay, France)

Agata Savary agata.savary at universite-paris-saclay.fr
Fri Oct 22 14:31:24 CEST 2021


Quantifying diversity of language phenomena in

corpora and system predictions: Case study of

multiword expressions

Master internship proposal, 2021-2022


Domain : natural language processing (NLP)


Location : Université Paris-Saclay (LISN lab),

Gif-sur-Yvette, France; with visits to the

University of Tours (LIFAT

<https://lifat.univ-tours.fr/>lab) and the

University of Orléans (LLL



Research teams : ILES


and Sign Language Processing) of the LISN ;



Bases and Natural Language Processing) of the



Description and Documentation) of the LLL









Emmanuel SCHANG






Funding : Université Paris-Saclay


Duration : 3-6 months


Remuneration : around 606 € / month

Motivation and context

Diversityof naturally occurring phenomena is a vital heritage to be preserved in the current progress- and optimization-driven globalization era. Diversity has been quantifiedin many domains: ecology, economy, information science, etc. but less so in natural language language processing (NLP). We are addressing this aspect with respect to a particular linguistic phenomenon: the one of multiword expressions(MWEs). MWEs are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For instance, the meaning of casser sa pipe‘to die’ (literally to break one’s pipe) or of sortir du lot'to be better than others' (literally to quit the batch)cannot be straightforwardly deduced from the meanings of the individual components. Due to these properties, MWEs are very challenging in applications like machine translation, information retrieval, opinion mining, etc.

Language resources dedicated to MWEs include MWE lexicons and MWE-annotated corpora (Savary et al., 2017), while a major computational task is to automatically identify MWEs in running text. The PARSEME <https://gitlab.com/parseme/corpora/-/wikis/home>network has been addressing the MWE identificationtask via a series of shared tasks on automatic identification of verbal MWEs <https://gitlab.com/parseme/corpora/-/wikis/home#shared-tasks>(Ramisch et al. 2020).

MWEs, like most other phenomena in human language, follow the so-called Zipf's law (Williams et al. 2015): few items are frequent and there is a long tail of rare ones. These few frequent items tend to be less diverse than the numerous items in the "Zipfian tail". Current models, including those for MWE identification, often favour the former and underperform in the latter. Hence, quality is overestimated and diversity is weakly accounted for.

To meet this challenge, our recent work (Lion-Bouton, 2021) is explicitly dedicated to quantifying diversity in MWE language resources. We have adapted measures of variety (number of types in a system), balance (equity of items in various types) and disparity (differences between types), stemming notably from ecology and information theory (Morales 2021), to MWE lexicons extracted automatically from annotated corpora.


The objective of this internship is to apply the aforementioned MWE diversity measures to MWE-annotated corpora and MWE identification tools. More precisely, the following steps are to be undertaken:


characterizing a corpus (annotated for

morpho-syntax and MWEs) for variety, balance

and disparity of the vocabulary (casser sa

pipe, sortir du lot), morphological features

(plural, future) and syntactic structures

(verb-object, verb-prepositional-phrase)

occurring in the MWEs contained therein


developing methods of diversity-driven corpus

split, over-sampling and augmentation


designing evaluation scenarios for MWE

identifiers so that diversity of the results

is treated on par with global precision and recall


applying these scenarios to the system results

of edition 1.2


the PARSEME shared task


analysing the evaluation outcome and

characterizing the MWE identifiers as to their

account of MWE diversity

Candidate's profile


2nd-year master student in computational

linguistics, computer science or alike ;

excellent 1st-year master ou 3rd year bachelor

students will also be considered


Interests in linguistics and familiarity with

language technology


Good programming skills, preferably in Python

Important dates


Application deadline: 20 November 2021 (or

until filled)


Notification: 30 November 2021


Position starts: late January 2022 (at earliest)


Position ends: late July 2022

How to apply

Send your CV, a cover letter and a

transcript of your bachelor and master

grades to Adam Lion-Bouton

<adam.lion-bouton at etu.univ-tours.fr

<mailto:adam.lion-bouton at etu.univ-tours.fr>>,

Agata Savary

<first.last at universite-paris-saclay.fr

<mailto:first.last at universite-paris-saclay.fr>>,

Emmanuel Schang <first.last at univ-orleans.fr

<mailto:first.last at univ-orleans.fr>> and

Jean-Yves Antoine

<jean-yves.antoine at univ-tours.fr

<mailto:jean-yves.antoine at univ-tours.fr>>.



Baldwin, T. and Kim, S. N.

(2010)Multiword Expressions


in Nitin Indurkhya and Fred J. Damerau

(eds.) Handbook of Natural Language

Processing, Second Edition, CRC Press,

Boca Raton, USA, pp. 267-292.


Matthieu Constant, Gülşen Eryiğit,

Johanna Monti, Lonneke van der Plas,

Carlos Ramisch, Michael Rosner, and

Amalia Todirascu. 2017.Multiword

expression processing: A survey


Computational Linguistics, 43(4):837–892.


Adam Lion-Bouton (2021) Multi-criterion

optimisation for multiword expression lexicon

design promoting linguistic diversity,

Technical report, University of Tours.


Morales P. L., Lamarche-Perrin R.,

Fournier-S’niehotta R., Poulain R., Tabourier

L., Tarissan F. (2021) Measuring Diversity in

Heterogeneous Information Networks


in Theoretical Computer Science, Elsevier.


Carlos Ramisch, Agata Savary, Bruno Guillaume,

Jakub Waszczuk, Marie Candito, Ashwini Vaidya,

Verginica Barbu Mititelu, Archna Bhatia, Uxoa

Ińurrieta, Voula Giouli, Tunga Güngör, Menghan

Jiang, Timm Lichte, Chaya Liebeskind, Johanna

Monti, Sara Stymne, Abigail Walsh, Renata

Ramisch, Hongzhi Xu (2020) Edition 1.2 of the

PARSEME Shared Task on Semi-supervised

Identification of Verbal Multiword Expressions


in the Proceedings of the Joint Workshop on

Multiword Expressions and Electronic Lexicons

(MWE-LEX 2020), 13 December 2020, Barcelona,

Spain (online).


Agata Savary, Marie Candito, Verginica Barbu

Mititelu, Eduard Bejček, Fabienne Cap,

Slavomir Čéplö, Silvio Ricardo Cordeiro,

Gülşen Eryiğit, Voula Giouli, Maarten van

Gompel, Yaakov HaCohen-Kerner, Jolanta

Kovalevskaitė, Simon Krek, Chaya Liebeskind,

Johanna Monti, Carla Parra Escartín, Lonneke

van der Plas, Behrang QasemiZadeh, Carlos

Ramisch, Federico Sangati, Ivelina Stoyanova,

Veronika Vincze (2018) "PARSEME multilingual

corpus of verbal multiword expressions


in Stella Markantonatou, Carlos Ramisch, Agata

Savary, Veronika Vincze (Eds.) "Multiword

expressions at length and in depth: Extended

papers from the MWE 2017 workshop", Language

Science Press, Berlin, pp. 87-147.


Williams J. R., Lessard P. R., Desu S., Clark

E. M., Bagrow J. P., Danforth C. M., Dodds P.

S. (2015). Zipf’s law holds for phrases, not

words. Scientific Reports, 5.

* -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 50427 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20211022/52c5238e/attachment.txt>

More information about the Corpora mailing list