[Corpora-List] PhD position in knowledge extraction (Toulouse, France)

Cassia cassia.ts at gmail.com
Wed Jun 24 09:15:50 CEST 2015

PhD position in Knowledge extraction from semi-structured documents -- enrichment of DBpedia in French


We are seeking a candidate for a PhD position in the context of a collaboration between the MELODI (http://www.irit.fr/-Equipe-MELODI-) team of the Research Institute in Informatics of Toulouse (IRIT, CNRS UMR 5505) and the CLLE-ERSS (http://w3.erss.univ-tlse2.fr/) team of the Cognition, Languages, Ergonomics laboratory (CLLE, UMR 5263 CNRS). These laboratories form one of the strongest potentials of research in France, in Informatics and Linguistics, respectively. The teams have been collaborating for 20 years and are recognized experts in natural language processing, linguistic analysis of corpora, and knowledge engineering. One of their research areas concerns the linguistic characterisation of semantic relations in corpora and the operationalisation of these characterizations in order to facilitate the construction of knowledge models. Methods for analyzing both written texts - using lexico-syntactic patterns (Aussenac-Gilles and Jacques, 2008) or distributional analysis (Fabre et al 2014.) - and text structure (Kamel and al., 2014) have been developed. Methods have also been proposed for integrating different fragments of knowledge within a same model, by means of ontology alignments (Euzenat et al., 2013). Hence, this thesis aims at adapting and combining these methods and proposing novel ones, with a special focus on enriching the Web of data. The candidate will be co-supervised by Cécile Fabre, Professor at University of Toulouse 2, and Mouna Kamel, Assistant Professor at IRIT. The thesis will be funded in the context of a project « Communauté d’Universités et d’Établissements Toulouse – Région Midi-Pyrénées » (COMUE-Région).


This thesis addresses the problem of building semantic resources from semi-structured text. The attributes of the text layout, which organise the text and contribute significantly to its semantics, are underexploited by most classical Natural Language Processing (NLP) methods. A first aim of this thesis is to study the interaction between the visual structure and the discourse analysis, and thus to specify how the analysis of natural language and the analysis of the text structure can be combined together. The second aim is to evaluate the contribution of linguistic information within automated processes for the identification of semantic relations, and for their integration into a knowledge model.

The theoretical results will help to developing different knowledge extractors (in particular, semantic relation extractors) from semi-structured texts in French, in order to enrich a knowledge base. Each extractor will apply one particular technique (inspired or not by the methods developed by the teams) and will exploit the different properties (content and structure) of these texts. The experimental scenario will concern the enrichment of the French DBpedia resource ( http://fr.dbpedia.org/), by better exploiting the properties of the Wikipedia pages within the knowledge extraction process. These pages are semi-structured and rich in knowledge expressing concepts (domain-specific or general), relations, and rules associating them and giving them meaning. However, as for the DBPedia in English, this resource is currently constructed from very specific structured data (infobox, categories, links, etc.) from Wikipedia pages.


We are looking for a candidate with a Msc in Computer Engineering/Science or an adjacent field. The candidate must have followed lectures in natural language processing. She/he is required to have an interest in both linguistic (corpus analysis, study and description of linguistic phenomena, etc.) and statistical aspects that will allow her/him to develop learning-based approaches and distributional analysis techniques. Interest in the Semantic Web in general, and ontologies in particular, would also be appreciated. The student has to be fluent in French and has to have a very good level in English.


If you are interested in the above, please contact :

Cécile Fabre : cecile.fabre at univ-tlse2.fr

Mouna Kamel : mouna.kamel at irit.fr


(Aussenac-Gilles et Jacques, 2008) Aussenac–Gilles, N., Jacques, M.–P. : Designing and Evaluating Patterns for Relation Acquisition from Texts with Caméléon. In: Terminology 14,1, 145–73 (2008).

(Euzenat et al., 2013) J. Euzenat, M. Rosoiu, C. Trojahn dos Santos : Ontology matching benchmarks: Generation, stability, and discriminability. Journal of Web Semantics 21: 30-48 (2013).

(Fabre et al., 2014) Fabre, C., Hathout, N., Ho-Dac, L. M., Morlane-Hondère, F., Muller, P., Sajous, F., Tanguy, L., Van de Cruys, T. : Présentation de l'atelier SemDis 2014: sémantique distributionnelle pour la substitution lexicale et l'exploration de corpus spécialisés. Actes de l'atelier SemDis 2014, 21e Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2014), pp. 196-205, (2014).

(Kamel et al., 2014) Kamel, M., Rothenburger, B., Fauconnier, J-P. : Identification de relations sémantiques portées par les structures énumératives paradigmatiques : une approche symbolique et une approche par apprentissage supervisé. Revue d'Intelligence Artificielle, Hermès Science, Numéro spécial Ingénierie des Connaissances. Nouvelles évolutions., Vol. 28, N. 2-3, p. 271-296, (2014). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 6157 bytes Desc: not available URL: <https://mailman.uib.no/public/corpora/attachments/20150624/704b5455/attachment.txt>

More information about the Corpora mailing list