[Corpora-List] 3 years PhD funding

Isabelle Tellier isabelle.tellier at univ-paris3.fr
Sun Jun 12 19:22:23 CEST 2016

Ph.D. Thesis on Automatic coreference chains detection for French by combination of machine learning approaches and linguistic resources

Supervisors: Isabelle Tellier, Marco Dinarelli (Lattice), and Eric de la Clergerie (Alpage) Funding : Labex EFL (http://www.labex-efl.org/?q=en) in Paris (3 years) Start : October 2016


A coreference chain is composed by the set of expressions in a text referring to the same discourse entity (or event). Coreference chains ensure the continuity and coherence of discourses. Their detection is very important for several tasks, such as Information Extraction or Machine Translation. Automatic Coreference Chains Detection (ACCD henceforth) is a well-known task in NLP. It is fundamentally difficult: the "Winograd challenge", which consists in identifying the antecedent of a pronoun (a part of the task) has been indeed proposed as an alternative to the Turing test.

ACCD tasks have been proposed in several competitive international challenges (Sem-Eval-2 in 2010, CoNNL in 2011 and 2012). There has not been however any such challenge for French, as no French corpus was available up to 2014. The French ANCOR corpus (for "Anaphore et Coréférence dans les corpus Oraux", i.e. Anaphora and Coreference in Speech Corpora), developed at the University of Tours (France), has partially solved this problem (see [Lefèvre et al. 2014], in French). In order to explore this corpus, preliminary experiments have been made [Desoyer et al. 2015] (in French). The corpus presents some specificities due to the nature of speech transcriptions, and up to now there exists no complete (end-to-end) system for ACCD in French. Our project aims at filling this lack.

The goal of the Ph.D. thesis is to build an end-to-end system for coreference chains detection in French, that is the system must be able to extract coreference chains from French raw texts. Moreover the system must be able to integrate the ALPAGE NLP framework. Several problems must be tackled, in particular:

1) Identification of coreferent mentions in raw texts: we will assume that referential mentions are always named entities, pronouns or noun phrases. However pronouns and noun phrases are not always referential mentions. For example "it" in "it rains" (in French as well as in English) is impersonal, thus it is not a referential mention. The same holds for some non referential NPs in expressions like "Chicken!". Two different strategies are possible for detecting real mentions in texts: either directly, or by applying shallow syntactic tagging first, followed by a step where non-referential pronouns and noun phrases are filtered out. Another difficulty is created by embedded mentions. For example, in the ANCOR corpus, "improvement of working conditions" contains three embedded mentions: "working", "working conditions" and "improvement of working conditions". Detecting all three mentions is not trivial and local syntactic analysis is necessary.

2) Clustering detected mentions into co-referential classes. Many approaches have been tested to solve this problem [Lassalle 2015] (in French). Some of them are rule-based, other approaches re-cast the problem to be able to apply classical machine learning techniques. For example, a simple but effective approach consists in building all the possible pairs of mentions and deciding whether they are coreferential or not [Soon et al. 2001, Ng et Cardie 2001].

The main difficulty in this Ph.D. thesis will be the lack of corpora annotated with coreference chains in French. In order to overcome this problem, a wide range of different approaches must be tested, combining machine learning approaches (text classification, Conditional Random Fields, Neural Networks) with linguistic information (surface features extracted from multilingual data, but also features coming from lemmatization, distributional analysis, morphosyntactic tagging, chunking, syntactic analysis, semantic lexicons or discourse analysis…)

Références (Haghighi and Klein 2009) Haghighi and Klein, Simple Coreference Resolution with Rich syntactic and semantic features, EMNLP'09. Desoyer et al. 2015) A. Desoyer, F. Landragin, I. Tellier, A. Lefeuvre, J-Y. Antoine, Les coréférences à l'oral : une expérience d'apprentissage automatique sur le corpus ANCOR, revue TAL, numéro 55.2 sur le traitement automatique du langage parlé, p.97-121, 2015. (Lassalle 2015) E. Lassalle, Structured Learning with Latent Trees : a joint approach to coreference resolution, thèse de l’uiversité Paris Diderot, 2015. (Lee et al. 2013) Lee H., Chang A., Peirsman Y., Chambers N., Surdeanu M., Jurafsky D., Deterministic Co-reference Resolution Based on Entity-Centric, Precision Ranked Rules, Computational Linguistics, vol. 39, no 4, p. 885-916, 2013. (Lefèvre et al. 2014) : A. Lefeuvre, J-Y Antoine, E. Schang, Le corpus ANCOR_Centre et son outil de requêtage : application à l’étude de l’accord en genre et en nombre dans les coréférences et anaphores en français parlé, Actes du 4éme Congrés Mondial de Linguistique Française, 2014. (Ng et Cardie 2001) V. Ng, C. Cardie C., Improving Machine Learning Approaches to Corefrence Resolution, Proceedings of ACL’02, p. 104-111, 2002. (Soon et al. 2001) W. M. Soon, H . T. Ng, D. C. Y. Lim, A Machine Learning Approach to Coreference Resolution of Noun Phrases, Computational Linguistics, vol. 27, n4, p. 521-544, 2001.

For application

The Ph.D. thesis will take place at Lattice (Montrouge) and Alpage (Paris), and will be conducted in coordination with the French ANR project Democrat, leaded by Frederic Landragin at Lattice, and focusing on the same subject. The candidates aiming for application must hold a Master Degree in mathematics or computer science (proving their knowledge and skills in NLP and machine learning). Familiarity with French is highly desirable. Candidates must apply by sending CV, motivation letter and possibly Master exam scores to isabelle.tellier at univ-paris3.fr before 30th June 2016. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6650 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20160612/7c4d996e/attachment.txt>

More information about the Corpora mailing list