The work of the post-doc fellow will fall within the framework of the area “Document
Classification”. The aim is to design innovative approaches for classifying documents
(according to their nature: invoice, quotation, bank account details, …) in multi-channel
incoming document flows and to create, from these approaches, a prototype.
There are many scientific bottlenecks arising from this applicative context, mainly in the field
of machine learning:
– document classes are generally very unbalanced in existing corpuses. Indeed, some classes
are very well represented in the learning base (with many training documents), while others
are much less represented (if present at all). As a result, the approaches developed so far
offer very uneven rates of accuracy between classes.
– the intra-class variability is very large (sometimes even greater than the inter-class
variability). An example of this is the fact that two documents from the same company but
from different classes (e.g. an invoice and a quotation) may be closer than two invoices from
different companies in the representation space, both visually and in terms of textual
This post-doctoral work will be based on a detailed state of the art of existing approaches, to
identify their limits and propose innovative approaches that will help to overcome the
bottlenecks mentioned above. To solve these problems, we plan to use machine learning
techniques that, based on existing image and/or text content classification techniques, allow
– take both modalities (image and text) into account jointly for document classification
(multi-modality), in order to improve accuracy for most classes
– learn a class from very few (or even 0) examples (zero-shot learning), possibly on the fly, in
the document flow
– effectively implement a rejection strategy when the document to be classified is too “far”
from existing classes, or when the ambiguity between classes is too great (with thresholds
that will be et automatically or semi-automatically, depending on the corpus).
If the post-doc fellow would like to acquire/reinforce his/her experience of working in a
private company, we could arrange short collaborative working stays in the premises of Yooz
company, at Aimargues (Mediterranean coast).
The candidate, who holds a Ph.D. in the fields of computer science, computer engineering,
signal processing, or applied mathematics, must have a significant research experience in at
least two of the following areas:
– Machine learning / classification
– Image analysis
– Pattern recognition
Moreover, knowledge or experience of Automatic Language Processing would be
The candidate’s skills will include:
– Mastering one or more programming languages (Java, Python, C/C++…)
– Very good teamwork skills, having knowledge or experience of Agile methods would be a
plus (a the work will be carried out both in conjunction with researchers from the L3i
laboratory and the R&D department of the Yooz company).
– Good scientific writing skills, and fluency in writing and speaking English.