Quantifying diversity of language phenomena in
corpora and system predictions: Case study of
multiword expressions
Master internship proposal, 2021-2022
*
Domain : natural language processing (NLP)
*
Location : Université Paris-Saclay (LISN lab),
Gif-sur-Yvette, France; with visits to the
University of Tours (LIFAT
<https://lifat.univ-tours.fr/>lab) and the
University of Orléans (LLL
<https://www.univ-orleans.fr/fr/lll/le-laboratoire>lab)
*
Research teams : ILES
<https://www.limsi.fr/en/research/iles>(Written
and Sign Language Processing) of the LISN ;
BdTln
<https://lifat.univ-tours.fr/lifat-english-version/teams/bdtin>(Data
Bases and Natural Language Processing) of the
LIFAT and DDL
<https://lll.cnrs.fr/la-recherche/les-equipes/ddl/>(Language
Description and Documentation) of the LLL
*
Supervisors:
o
Adam LION-BOUTON (LIFAT)
o
Agata SAVARY
<http://www.info.univ-tours.fr/~savary/>(LISN)
o
Emmanuel SCHANG
<https://sites.google.com/site/emmanuelschang/>(LLL)
o
Jean-Yves ANTOINE
<http://www.info.univ-tours.fr/~antoine/>(LIFAT)
*
Funding : Université Paris-Saclay
*
Duration : 3-6 months
*
Remuneration : around 606 € / month
Motivation and context
Diversityof naturally occurring phenomena is a vital heritage to be preserved in the current progress- and optimization-driven globalization era. Diversity has been quantifiedin many domains: ecology, economy, information science, etc. but less so in natural language language processing (NLP). We are addressing this aspect with respect to a particular linguistic phenomenon: the one of multiword expressions(MWEs). MWEs are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For instance, the meaning of casser sa pipe‘to die’ (literally to break one’s pipe) or of sortir du lot'to be better than others' (literally to quit the batch)cannot be straightforwardly deduced from the meanings of the individual components. Due to these properties, MWEs are very challenging in applications like machine translation, information retrieval, opinion mining, etc.
Language resources dedicated to MWEs include MWE lexicons and MWE-annotated corpora (Savary et al., 2017), while a major computational task is to automatically identify MWEs in running text. The PARSEME <https://gitlab.com/parseme/corpora/-/wikis/home>network has been addressing the MWE identificationtask via a series of shared tasks on automatic identification of verbal MWEs <https://gitlab.com/parseme/corpora/-/wikis/home#shared-tasks>(Ramisch et al. 2020).
MWEs, like most other phenomena in human language, follow the so-called Zipf's law (Williams et al. 2015): few items are frequent and there is a long tail of rare ones. These few frequent items tend to be less diverse than the numerous items in the "Zipfian tail". Current models, including those for MWE identification, often favour the former and underperform in the latter. Hence, quality is overestimated and diversity is weakly accounted for.
To meet this challenge, our recent work (Lion-Bouton, 2021) is explicitly dedicated to quantifying diversity in MWE language resources. We have adapted measures of variety (number of types in a system), balance (equity of items in various types) and disparity (differences between types), stemming notably from ecology and information theory (Morales 2021), to MWE lexicons extracted automatically from annotated corpora.
Objectives
The objective of this internship is to apply the aforementioned MWE diversity measures to MWE-annotated corpora and MWE identification tools. More precisely, the following steps are to be undertaken:
*
characterizing a corpus (annotated for
morpho-syntax and MWEs) for variety, balance
and disparity of the vocabulary (casser sa
pipe, sortir du lot), morphological features
(plural, future) and syntactic structures
(verb-object, verb-prepositional-phrase)
occurring in the MWEs contained therein
*
developing methods of diversity-driven corpus
split, over-sampling and augmentation
*
designing evaluation scenarios for MWE
identifiers so that diversity of the results
is treated on par with global precision and recall
*
applying these scenarios to the system results
of edition 1.2
<https://gitlab.com/parseme/sharedtask-data/-/tree/master/1.2/system-results>of
the PARSEME shared task
*
analysing the evaluation outcome and
characterizing the MWE identifiers as to their
account of MWE diversity
Candidate's profile
*
2nd-year master student in computational
linguistics, computer science or alike ;
excellent 1st-year master ou 3rd year bachelor
students will also be considered
*
Interests in linguistics and familiarity with
language technology
*
Good programming skills, preferably in Python
Important dates
*
Application deadline: 20 November 2021 (or
until filled)
*
Notification: 30 November 2021
*
Position starts: late January 2022 (at earliest)
*
Position ends: late July 2022
How to apply
Send your CV, a cover letter and a
transcript of your bachelor and master
grades to Adam Lion-Bouton
<adam.lion-bouton at etu.univ-tours.fr
<mailto:adam.lion-bouton at etu.univ-tours.fr>>,
Agata Savary
<first.last at universite-paris-saclay.fr
<mailto:first.last at universite-paris-saclay.fr>>,
Emmanuel Schang <first.last at univ-orleans.fr
<mailto:first.last at univ-orleans.fr>> and
Jean-Yves Antoine
<jean-yves.antoine at univ-tours.fr
<mailto:jean-yves.antoine at univ-tours.fr>>.
References
*
Baldwin, T. and Kim, S. N.
(2010)Multiword Expressions
<https://people.eng.unimelb.edu.au/tbaldwin/pubs/handbook2009.pdf>,
in Nitin Indurkhya and Fred J. Damerau
(eds.) Handbook of Natural Language
Processing, Second Edition, CRC Press,
Boca Raton, USA, pp. 267-292.
*
Matthieu Constant, Gülşen Eryiğit,
Johanna Monti, Lonneke van der Plas,
Carlos Ramisch, Michael Rosner, and
Amalia Todirascu. 2017.Multiword
expression processing: A survey
<https://www.mitpressjournals.org/doi/full/10.1162/COLI_a_00302>.
Computational Linguistics, 43(4):837–892.
*
Adam Lion-Bouton (2021) Multi-criterion
optimisation for multiword expression lexicon
design promoting linguistic diversity,
Technical report, University of Tours.
*
Morales P. L., Lamarche-Perrin R.,
Fournier-S’niehotta R., Poulain R., Tabourier
L., Tarissan F. (2021) Measuring Diversity in
Heterogeneous Information Networks
<https://pedroramaciotti.github.io/files/publications/2021_TCS.pdf>,
in Theoretical Computer Science, Elsevier.
*
Carlos Ramisch, Agata Savary, Bruno Guillaume,
Jakub Waszczuk, Marie Candito, Ashwini Vaidya,
Verginica Barbu Mititelu, Archna Bhatia, Uxoa
Ińurrieta, Voula Giouli, Tunga Güngör, Menghan
Jiang, Timm Lichte, Chaya Liebeskind, Johanna
Monti, Sara Stymne, Abigail Walsh, Renata
Ramisch, Hongzhi Xu (2020) Edition 1.2 of the
PARSEME Shared Task on Semi-supervised
Identification of Verbal Multiword Expressions
<https://www.aclweb.org/anthology/2020.mwe-1.14/>,
in the Proceedings of the Joint Workshop on
Multiword Expressions and Electronic Lexicons
(MWE-LEX 2020), 13 December 2020, Barcelona,
Spain (online).
*
Agata Savary, Marie Candito, Verginica Barbu
Mititelu, Eduard Bejček, Fabienne Cap,
Slavomir Čéplö, Silvio Ricardo Cordeiro,
Gülşen Eryiğit, Voula Giouli, Maarten van
Gompel, Yaakov HaCohen-Kerner, Jolanta
Kovalevskaitė, Simon Krek, Chaya Liebeskind,
Johanna Monti, Carla Parra Escartín, Lonneke
van der Plas, Behrang QasemiZadeh, Carlos
Ramisch, Federico Sangati, Ivelina Stoyanova,
Veronika Vincze (2018) "PARSEME multilingual
corpus of verbal multiword expressions
<http://langsci-press.org/catalog/view/204/1344/1319-1>",
in Stella Markantonatou, Carlos Ramisch, Agata
Savary, Veronika Vincze (Eds.) "Multiword
expressions at length and in depth: Extended
papers from the MWE 2017 workshop", Language
Science Press, Berlin, pp. 87-147.
*
Williams J. R., Lessard P. R., Desu S., Clark
E. M., Bagrow J. P., Danforth C. M., Dodds P.
S. (2015). Zipf’s law holds for phrases, not
words. Scientific Reports, 5.
* -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 50427 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20211022/52c5238e/attachment.txt>