PhD position in Knowledge extraction from semi-structured documents —
enrichment of DBpedia in French


We are seeking a candidate for a PhD position in the context of a
collaboration between the MELODI (http://www.irit.fr/-Equipe-MELODI-) team
of the Research Institute in Informatics of Toulouse (IRIT, CNRS UMR 5505)
and the CLLE-ERSS (http://w3.erss.univ-tlse2.fr/) team of the Cognition,
Languages, Ergonomics laboratory (CLLE, UMR 5263 CNRS). These laboratories
form one of the strongest potentials of research in France, in Informatics
and Linguistics, respectively. The teams have been collaborating for 20
years and are recognized experts in natural language processing, linguistic
analysis of corpora, and knowledge engineering. One of their research areas
concerns the linguistic characterisation of semantic relations in corpora
and the operationalisation of these characterizations in order to
facilitate the construction of knowledge models. Methods for analyzing both
written texts – using lexico-syntactic patterns (Aussenac-Gilles and
Jacques, 2008) or distributional analysis (Fabre et al 2014.) – and text
structure (Kamel and al., 2014) have been developed. Methods have also been
proposed for integrating different fragments of knowledge within a same
model, by means of ontology alignments (Euzenat et al., 2013). Hence, this
thesis aims at adapting and combining these methods and proposing novel
ones, with a special focus on enriching the Web of data. The candidate will
be co-supervised by Cécile Fabre, Professor at University of Toulouse 2,
and Mouna Kamel, Assistant Professor at IRIT. The thesis will be funded in
the context of a project « Communauté d?Universités et d?Établissements
Toulouse ? Région Midi-Pyrénées » (COMUE-Région).


This thesis addresses the problem of building semantic resources from
semi-structured text. The attributes of the text layout, which organise the
text and contribute significantly to its semantics, are underexploited by
most classical Natural Language Processing (NLP) methods. A first aim of
this thesis is to study the interaction between the visual structure and
the discourse analysis, and thus to specify how the analysis of natural
language and the analysis of the text structure can be combined together.
The second aim is to evaluate the contribution of linguistic information
within automated processes for the identification of semantic relations,
and for their integration into a knowledge model.

The theoretical results will help to developing different knowledge
extractors (in particular, semantic relation extractors) from
semi-structured texts in French, in order to enrich a knowledge base. Each
extractor will apply one particular technique (inspired or not by the
methods developed by the teams) and will exploit the different properties
(content and structure) of these texts. The experimental scenario will
concern the enrichment of the French DBpedia resource (
http://fr.dbpedia.org/), by better exploiting the properties of the
Wikipedia pages within the knowledge extraction process. These pages are
semi-structured and rich in knowledge expressing concepts (domain-specific
or general), relations, and rules associating them and giving them meaning.
However, as for the DBPedia in English, this resource is currently
constructed from very specific structured data (infobox, categories, links,
etc.) from Wikipedia pages.




We are looking for a candidate with a Msc in Computer Engineering/Science
or an adjacent field. The candidate must have followed lectures in natural
language processing. She/he is required to have an interest in both
linguistic (corpus analysis, study and description of linguistic phenomena,
etc.) and statistical aspects that will allow her/him to develop
learning-based approaches and distributional analysis techniques. Interest
in the Semantic Web in general, and ontologies in particular, would also be

Language requirements:

The student has to be fluent in French and has to have a very good level in English.

Educational level:

Master Degree

Tagged as: , , , , , ,