[Corpora-List] Research internship: 3-6 months, Multiword Expressions: Generalizing over Unseen data (LISN, Orsay, France)

Koos Wilt kooswilt at gmail.com
Sat Oct 23 22:55:53 CEST 2021

I apologize for this incomplete post, for one, my FBK is down and the problem will not be repaired until after the weekend. I have developed a way of dealing w/ multiword expressions based on 'dependency'. Mutual Information allows one to make statements about the terms being dependent or not. This allows one to check whether SUBJECT, VERB & OBJECT entities are dependent, that is, are a real predicate, or just random entities. I think that what I did is applicable to any and all multiword expressions. My FaceBook name is "Koos Van Der Wilt" and I talk about this approach at some length there. Feel free to contact me after the weekend.


Op za 23 okt. 2021 om 21:14 schreef Agata Savary < agata.savary at universite-paris-saclay.fr>:

> * How well can deep learning algorithms generalize over unseen data: A
> case study in multiword expression identification Master internship
> proposal, 2021-2022 - Domain: natural language processing - Location:
> Université Paris-Saclay (LISN <https://www.lisn.upsaclay.fr/>),
> Gif-sur-Yvette, France - Research teams: ILES
> <https://www.limsi.fr/en/research/iles> (Written and Sign Language
> Processing) of the LISN; TALEP <https://talep.lis-lab.fr/> (Written and
> Spoken Language Processing) of the LIS - Supervisors: - Agata Savary
> <http://www.info.univ-tours.fr/~savary/> (LISN) - Carlos Ramisch
> <http://pageperso.lis-lab.fr/carlos.ramisch/> (LIS) - Funding: Université
> Paris-Saclay - Duration: 3-6 months - Remuneration: around 606€/month
> Motivation and context The aim of this internship is to boost applications
> in Natural Language Processing (NLP), by focusing on one of their major
> challenges: multiword expressions (MWEs). MWEs are groups of words which
> exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently,
> their meaning does not straightforwardly derive from the meanings of their
> components. For instance, faire ‘make/do’ and valoir ‘be worth sth’ are
> verbs, while their combination yields a noun: faire-valoir ‘a stooge, a
> person who is used by somebody to do things that are unpleasant or
> dishonest’. Similarly, the meaning of casser sa pipe ‘to die’ (literally to
> break one’s pipe) cannot be straightforwardly deduced from the meanings of
> the individual components. Due to these properties, MWEs are very
> challenging in applications like machine translation, information
> retrieval, opinion mining, etc. A major task related to MWEs is to
> automatically identify their occurrences in running text (so as to provide
> more accurate representations to downstream applications). The PARSEME
> <https://gitlab.com/parseme/corpora/-/wikis/home> network has been
> addressing this task via a series of shared tasks on automatic
> identification of verbal MWEs
> <https://gitlab.com/parseme/corpora/-/wikis/home#shared-tasks>. Edition 1.1
> of the PARSEME shared task (in 2018) showed critical hardness of
> identifying MWEs which have not been previously seen in the training
> corpus. Edition 1.2 saw the advent of transformer-based language models
> (BERT), which brought substantial progress to MWE identification
> performances. Still, only modest progress was achieved in generalization
> over unseen data. Objectives The aim of this internship is to better
> understand the potential of transformer-based models in generalising over
> unseen data in MWE identification. More precisely we wish to: - analyze the
> results of edition 1.2
> <https://gitlab.com/parseme/sharedtask-data/-/tree/master/1.2/system-results>
> of the PARSEME shared task, and in particular those related to unseen data
> - propose an error analysis methodology for MWEs which are and are not
> correctly identified, and try to understand the reasons behind this state
> of the affairs - put forward recommendations for future enhancements of the
> state-of-the-art MWE identifiers - (depending on the candidate's profile
> and the length of the internship) implement a prototype based on these
> recommendations Candidate's profile - 2nd-year master student in
> computational linguistics, computer science or alike ; excellent 1st-year
> master ou 3rd year bachelor students will also be considered - Interests in
> linguistics and familiarity with language technology - Good programming
> skills, preferably in Python Important dates - Application deadline: 20
> November 2021 (or until filled) - Notification: 30 November 2021 - Position
> starts: late January 2022 (at earliest) - Position ends: around late July
> 2022 (or later) How to apply Send your CV and a transcript of your bachelor
> and master grades to Agata Savary <first.last at universite-paris-saclay.fr>
> <first.last at universite-paris-saclay.fr> and Carlos Ramisch
> <first.last at lis-lab.fr> <first.last at lis-lab.fr>. References - Baldwin, T.
> and Kim, S. N. (2010) Multiword Expressions
> <https://people.eng.unimelb.edu.au/tbaldwin/pubs/handbook2009.pdf>, in
> Nitin Indurkhya and Fred J. Damerau (eds.) Handbook of Natural Language
> Processing, Second Edition, CRC Press, Boca Raton, USA, pp. 267-292. -
> Matthieu Constant, Gülşen Eryiğit, Johanna Monti, Lonneke van der Plas,
> Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017. Multiword
> expression processing: A survey
> <https://www.mitpressjournals.org/doi/full/10.1162/COLI_a_00302>.
> Computational Linguistics, 43(4):837–892. - Carlos Ramisch, Agata Savary,
> Bruno Guillaume, Jakub Waszczuk, Marie Candito, Ashwini Vaidya, Verginica
> Barbu Mititelu, Archna Bhatia, Uxoa Ińurrieta, Voula Giouli, Tunga Güngör,
> Menghan Jiang, Timm Lichte, Chaya Liebeskind, Johanna Monti, Sara Stymne,
> Abigail Walsh, Renata Ramisch, Hongzhi Xu (2020) Edition 1.2 of the PARSEME
> Shared Task on Semi-supervised Identification of Verbal Multiword
> Expressions <https://www.aclweb.org/anthology/2020.mwe-1.14/>, in the
> Proceedings of the Joint Workshop on Multiword Expressions and Electronic
> Lexicons (MWE-LEX 2020), 13 December 2020, Barcelona, Spain (online). Agata
> Savary, Silvio Ricardo Cordeiro, Carlos Ramisch (2019) Without lexicons,
> multiword expression identification will never fly: A position statement
> <https://www.aclweb.org/anthology/papers/W/W19/W19-5110/>, In the
> Proceedings of the Joint Workshop on Multiword Expressions and WordNet
> (MWE-WN 2019), 2 August 2019, Florence, Italy.*
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> https://mailman.uib.no/listinfo/corpora
-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 34476 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20211023/8d106628/attachment.txt>

More information about the Corpora mailing list