Research interests

Machine Learning and Deep Learning
Natural Language Processing (NLP), in particular sequence modelling
Automatic Speech Recognition and Understanding (ASRU)
Probabilistic models, in particular Neural Networks, Conditional Random Fields (CRF), Stochastic Finite State Machines (FSM), Support Vector Machines (SVM), probabilistic grammars
Representation learning

Research projects

Pantagruel: Modèles de langue multimodaux et inclusifs pour le français général et clinique (WP leader), October 2023 - April 2027
E-SSL: Efficient Self-Supervised Learning for Inclusive and Innovative Speech Technologies (Collaborator), November 2022 - April 2026
ANR PRC project (CE23)
CREMA: Coreference REsolution into MAchine translation (PI), January 2022 - December 2025
ANR JCJC (Jeunes Chercheuses Jeunes Chercheurs) project (CE23)

Previous projects

MAKE-NMTViz: Visualisation and explanation of NMT models (Collaborator), September 2022 - September 2024

Chaire MIAI (Multidisciplinary Institute in Artificial Intelligence) (Collaborator), October 2019 - December 2024

Multi-Task Sequence Prediction for NLP (PI), January 2021 - December 2021
LIG local Emergence project
Neural Coreference Resolution (PI), January 2019 - December 2019
LIG local Emergence project
ANR DEMOCRAT (Collaborator), January 2016 - December 2019
DEscription et MOdélisation des Chaïnes de Référence : outils pour l'Annotation de corpus (en diachronie et en langues comparées) et le Traitement automatique
Quaero (Collaborator), Juin 2010 - September 2013
TRACE (Collaborator), December 2011 - November 2012
Live Memories (Collaborator), November 2009 - March 2010
LUNA (Collaborator), October 2006 - October 2009

Supervising

Post docs

Gabriela Gonzales-Saez, 10/2024 - 09/2025, funded by ANR JCJC CREMA
Subject : Context-Aware NMT models explainability

Hang Le, 10/2023 - 12/2024, funded by Pantagruel
Subject : Multi-Modal SSL Models for Text, Speech and Image

Gabriela Gonzales-Saez, 07/2023 - 09/2024, funded by MAKE-NMTViz
Subject : NMT models visualisation and explainability

Elisa Gugliotta, 06/2022 - 02/2023, funded by Chaire MIAI (Multidisciplinary Institute in Artificial Intelligence)
Subject : NLP for Arabish analysis

Ph.D. students

Yuxuan Zhang, 2024 - 2027, Ph.D. student CIFRE at Eloquant
with Fabien Ringeval, Ruslan Kalitvianski
Subject : Prediction of user satisfaction
Ph.D. in progress

Ryan Whetten, 2023 - 2026, Ph.D. student at LIA, UGA, Samsung AI Center Cambridge
with Yannick Estève, Titouan Parcollet
Subject : Efficient SSL Models for Speech
Ph.D. in progress

Mariam Nakhlé, 2022 - 2025, Ph.D. student CIFRE at Lingua Custodia
with Emmanuelle Esperança-Rodier, Raheel Qader
Subject : Document-Level Machine Translation Evaluation
Ph.D. in progress

Fabien Lopez, 2022 - 2025, Ph.D. student at UGA
with Didier Schwab, Emmanuelle Esperança-Rodier
Subject : Coreference Resolution and Machine Translation
Ph.D. in progress

Lorenzo Lupo, 2019 - 2022, Ph.D. student at UGA
with Laurent Besacier
Subject : Document-Level Neural Machine Translation
Ph.D. defended in March 2023

Elisa Gugliotta, 2019 - 2022, Ph.D. student at La Sapienza, UGA
with Giuliano Mion, Olivier Kraif
Subject : NLP for Arabish analysis
Ph.D. defended in May 2022

Loïc Grobol, 2016 - 2020, Ph.D. student at Paris 3
with Isabelle Tellier/Frédéric Landragin, Eric De La Clergerie
Subject : Coreference Resolution
Ph.D. defended in July 2020

Tian Tian, 2014 - 2019, Ph.D. student CIFRE at Synthesio
with Isabelle Tellier/Thierry Poibeau
Subject : NLP for User-Generated-Content analysis
Ph.D. defended in October 2019

Yoann Dupont, 2013 - 2017, Ph.D. student CIFRE at Expert System (ex Temis)
with Isabelle Tellier
Subject : Named Entity Detection
Ph.D. defended in November 2017

Master students

2025 Master 2 Ilya Kholosha, Subject : Context-Aware NMT Evaluation: evaluation procedure

2025 Master 2 Rachel Atherly, Subject : Context-Aware NMT Evaluation: document-level data

2023 Master 2 Dimitra Niaouri, Subject : Context-Aware Machine Translation Evaluation

2022 Master 2 Romaissa Kessi, Subject : Classification of political adds

2021 Master 2 Lyheang Ung, Subject : Multi-task sequence-to-sequence learning

2021 Master 2 Marco Naguib, Subject : End-to-End Spoken Language Understanding

2021 Master 2 Laura Alonzo Canul, Subject : Document-Level Neural Machine Translation

2019 Master 2 Julien Sfeir, Subject : Neural Coreference Resolution

2019 Master 2 Nikita Kapoor, Subject : End-to-End Spoken Language Understanding

2017 Master 2 Evann Cordier, Subject : Entity-Aware Language Models

2016 Master 2 Nour El Houda Belhaouane, Subject : Mention detection for coreference resolution

2015 Master 2 Abdelwahed Zaki, Subject : Mention detection for coreference resolution

2015 Master 1 Sina Ahmadi, Subject : Entity detection for coreference resolution

Teaching

CM Natural Language Processing for master Mosig 2024 (English only) @ UGA (8h)
Material:

Coreference Resolution slides
Introduction to Statistical Machine Translation slides
Neural Machine Translation slides

For going further with NMT, suggested readings:

LSTM based NMT
Transformer based NMT
CNN based NMT
Concatenation based Context-Aware NMT (CA-NMT)
Multi-Encoder based CA-NMT
BLEU evaluation metric for MT
Contrastive test-suite evaluation for MT

Neural Machine Translation Evaluation slides

CM Natural Language Processing for master Mosig 2023 (English only) @ UGA (5h)

CM Natural Language Processing for master Mosig 2022 (English only) @ UGA (5h)
TD Analyse Syntaxique 2019 @ UGA (40h)
TD Traitement Automatique de Langues (TAL) 2015 @ Paris 6 (40h)
CM Introduction au TAL @ Paris 3 (4h)

Others

I'm regularly reviewer of national and internationl journal papers
I'm regularly in the scientific program commettee (reviewer) of conferences such as IJCAI, AAAI, IJCNLP, TALN, ...

2024, Area chair at the international conference Coling 2025
2024, Examinator for the Ph.D. defense of Gaëlle Laperrière
2024, Project evaluator for the MIAI
2024, Project evaluator for the IRGA
2024, Project evaluator for the ANR
2023, Paper evaluation committee member at EMNLP
2023, Project evaluation committee member at ANR
2022, Co-organizer of the workshop "Rumore di fondo o valore aggiunto" at Grenoble on detecting noise in annotated data
2022, Talk about the LeBenchmark project at the GENCI big challanges day at LPS, Orsay, France
2022, Chair of the session Spoken Language Modeling and Understanding at Interspeech 2022
2022, Program committee member at the joint GDR LIFT\&NLP days
2022, Talk about the LeBenchmark project at the French-German workshop on AI, INRIA Rocancourt, Paris
2022, Co-organizer of the GDR TAL day on oral language representation learning
2019, Examinator for the Ph.D. defense of Edwin Simonnet
2018, Project evaluator for the ANR
2017, Area chair at the French conference TALN
2016, Project evaluator for the Fond de recherche Nature et Technologies Québec
Program Commettee member at the International Conference of the Association for Computational Linguistics (ACL) 2015
Reviewer for the Journal of IEEE Signal Processing Letters 2015
Program Commettee member at the International Conference of the Association for Computational Linguistics (ACL) 2014
Reviewer for the Journal of Natural Language Engineering (JNLE) 2013
Program Commettee member at the International Conference of the Association for Computational Linguistics (ACL) 2013
Program Commettee member at the International Joint Conference on Artificial Intelligence (IJCAI) 2013

Previous research applications

Extended Named Entity Detection
Named Entity Detection is a well-known NLP task used as preliminary step to extract semantic information, to be used in more complex application. Beyond simple named entity detection tasks like the CoNLL shared task 2003, during last years more complex named entity sets have been defined, e.g. the one described in (Sekine and Nobata, 2004). Despite the complexity of the entity sets, most of the named entity detection tasks defined in the last years, can be tackled more or less as sequence labeling tasks.
During the first part of my post-doc at LIMSI-CNRS, I have been working on a new set of named entities defined within the project Quaero. This new set of named entities is described in (Grouin et Al., 2011), and its main difference with respect to previous entity sets is constituted by entities having complex tree-structures, where simple entity components can be combined to have complex and higher level entities.
Given such tree structure, the task cannot be tackled as sequence labeling, which makes it more difficult. A further contribution to make the task harder is the kind of data used for annotating the named entity: transcriptions of French broadcast data, coming from different French and North-African radio channels.
In order to address these issues, after trying approaches coming from syntactic parsing without success, I used an approach combining the robustness of Conditional Random Fields (CRF) (Lafferty et Al.,2001) in sequence labeling tasks, with the ability of syntactic parsing algorithms (e.g. (Charniak, 1997)) to generate tree structures from flat sequences in an effective way, even on noisy data.
My approach uses CRF models to tag simple entity components on words, while a Probabilistic Context-Free Grammar (PCFG) along with a chart-parsing algorithm reconstruct the whole entity tree. The advantage with this approach is that CRFs are particularly effective for sequence labeling and robust to noisy data, they can thus provide an accurate annotation even using noisy data like French broadcast news. Once the words are annotated with entity components, since entity trees are far simpler than syntactic trees, even a simple model like PCFG is effective for parsing entity trees.
This approach has been evaluated in the 2011 Quaero named entity detection evaluation campaign, ranking first by a large margin.
Details about this approach are described in (Dinarelli Rosset, IJCNLP 2011). Recently, some advances have been published in (Dinarelli Rosset, EACL 2012), where several different tree structures have been used in order to encode some context in the PCFG. The same approach has been also recently applied to OCR-ized data dating from 1890, after a preprocessing step described in details in (Dinarelli Rosset, LREC 2012)

Spoken Dialog Systems
Spoken Dialog Systems (SDS) are speech applications allowing humans to engage a dialog with a machine in order to solve a task.
During my Ph.D. I've been working on the LUNA project SDS prototype, in particular I designed the understanding module of the application. The main goal was to develop an evolution of a simple call routing application in Italian, in the domain of hardware/software problem solving. The understanding module of the application integrates state-of-the-art Spoken Language Understanding models, complemented with a sentence classifier.
Once the system understands the problem, as belonging to one of 10 possible scenarios, it redirects the user to an operator able to provide further assistance.
For more details see (Dinarelli et Al., ICASSP 2010).

Ontology-Based Spoken Language Understanding
From a computer science perspective, an ontology is a taxonomy of classes linked by some relations. In a Spoken Language Understanding (SLU) context, classes are semantic classes, or concepts, relations are semantic relations between concepts.
Beyond traditional ontology relations, e.g. "is-a" or "part-of", we have defined some specific relations among concepts taken from the Italian corpus of Spoken Dialogs described in (Dinarelli et Al., EACL 2009b).
The corpus covers the domain of problem solving for hardware/software repairing and has been used for the development and evaluation of Spoken Language Understanding systems (see e.g. (Dinarelli et Al., EACL 2009a)).
We used the ontology semantic relations in order to assess semantic interpretation hypotheses generated by a baseline SLU system based on Stochastic Finite State Transducers, like the one described in (Dinarelli et Al., EACL 2009a). We choose the most consistent hypothesis with respect to the Ontology Relatedness measure defined in (Quarteroni et Al., ASRU 2009).
Despite the final results in terms of accuracy were not improving state-of-the-art, this idea received good feedback at Interspeech 2009 conference and ASRU 2009 workshop.

Ph.D. Thesis

The topic of my Ph.D. was Spoken Language Understanding (SLU) models for Spoken Dialog Systems. The work focused on the integration of different SLU models using discriminative re-ranking algorithms (Collins,2000).
Two models for hypotheses generation were used: Stochastic Finite State Transducers (SFST), encoding a semantic language model (Raymond et Al.,2006), and Conditional Random Fields (CRF) (Lafferty et Al.,2001). The re-ranking model was based on Support Vector Machines (Vapnik,1998) with Kernel Methods, in particular String Kernels (Shawe-Taylor&Cristianini,2004) and Tree Kernels (Collins&Duffy,2001) (Moschitti,2006).
New tree-structured features for kernels have been designed with the aim of giving an effective representation of SLU hypotheses in SVM (Dinarelli et Al., EMNLP 2009).
An important contribution to reranking is the hypotheses selection criteria: a heuristic providing a semantic inconsistency metric over hypotheses allowing to select the best hypotheses among those generated by SFST or CRF, for details see (Dinarelli et Al., SLT 2010), (Dinarelli Rosset, EMNLP 2011), and (Dinarelli et Al., IEEE 2011).
The joint models based on reranking have been evaluated on 4 different corpora in 4 different languages: ATIS (English), MEDIA (French), Italian and Polish corpora acquired within the European project LUNA (see (Dinarelli et Al., EACL 2009b) for the Italian corpus). An exhaustive comparison with several state-of-the-art models has been performed, showing the effectiveness of reranking models, all details are in my Ph.D. dissertation (Dinarelli, Ph.D. Dissertation 2010).

Master Degree Thesis

During my Master Thesis I have studied, implemented and evaluated an application for data clustering and compression.
Data compression algorithms can be thought of as functions transforming data so that to reduce local redundancy. The data redundancy is detected by the compression algorithm inside a window on the input data stream. Redundancy detection is limited to this window, this can constitute a serious limitation when compressing relatively large amount of data. Common data compression algorithms, like the Lempel-Ziv algorithm family used in zip and gzip Linux tools, or algorithms using the Burrows-Wheeler Transform (BWT) like bzip2 Linux tool, use a fixed-size window (e.g. the options -1,...,-9, used in mutual exclusion, fix the window size to 100K,...,900K).
A possible way to improve the compression performance is to increase the window size. Unfortunately this solution increases also the compression time, that in the worst case cannot be bounded a priori.
The solution studied in the thesis works on the opposite point of view: instead of arbitrarily increasing the window size in order to detect data redundancies far away in the documents, we apply a fast data clustering algorithm putting (possibly) close together similar sub-parts of documents, thus increasing data local redundancy. After the clustering phase, data are compressed using a variable-size window algorithm. The window size bound has been computed empirically with a set of experiments where increasing window size was used. The clusterisation phase has been performed using min-wise independent linear permutations (Bohman, Cooper, Frieze 2000) to convert document sub-parts into feature vectors. These were then mapped into one-dimensional real number space using Locality Sensitive Hashing (LSH) (Andoni, Indyk 2006). Exploiting LSH properties (similar vectors, and so similar documents sub-parts, are hashed close together in the real line), we just re-sort document sub-parts using hash values order, thus getting possibly highly redundant data. The final data compression step is performed with an algorithm based on the BWT, provided by my advisor Professor Paolo Ferragina

Bibliography

(Dinarelli et Al., IEEE 2012)
Marco Dinarelli, A. Moschitti, G. Riccardi
Discriminative Reranking for Spoken Language Understanding
IEEE Journal of Transactions on Audio, Speech and Language Processing (TASLP), volume 20, issue 2, pages 526 - 539, 2012.

(Dinarelli Rosset, LREC 2012)
Marco Dinarelli, S. Rosset
Tree-Structured Named Entity Recognition on OCR Data: Analysis, Processing and Results
In Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul, Turkey, 2012.

(Dinarelli Rosset, EACL 2012)
Marco Dinarelli, S. Rosset
Tree Representations in Probabilistic Models for Extended Named Entity Detection
In Proceedings of the European chapter of the Association for Computational Linguistics (EACL), Avignon, France, 2012.

(Dinarelli Rosset, IJCNLP 2011)
Marco Dinarelli, S. Rosset
Models Cascade for Tree-Structured Named Entity Detection
In Proceedings of International Joint Conference on Natural Language Processing (IJCNLP), Chiang Mai, Thailand, 2011.

(Dinarelli Rosset, EMNLP 2011)
Marco Dinarelli, S. Rosset
Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), Edinburgh, U.K., 2011.

(Dinarelli et Al., SLT 2010)
Marco Dinarelli, A. Moschitti, G. Riccardi
Hypotheses Selection For Re-ranking Semantic Annotations
IEEE Workshop on Spoken Language Technology (SLT), Berkeley, U.S.A., 2010.

(Dinarelli, Ph.D. Dissertation 2010)
Marco Dinarelli
Spoken Language Understanding: from Spoken Utterances to Semantic Structures
Ph.D. Dissertation, University of Trento
Department of Computer Science and Information Engineering (DISI), Italy, 2010.

(Dinarelli et Al., ICASSP 2010)
Marco Dinarelli, E. Stepanov, S. Varges, G. Riccardi
The LUNA Spoken Dialog System: Beyond Utterance Classification
In Proceedings of International Conference of Acoustics, Speech and Signal Processing (ICASSP), Dallas, USA, 2010.

(Dinarelli et Al., EMNLP 2009)
Marco Dinarelli, A. Moschitti, G. Riccardi
Reranking Models Based On Small Training Data For Spoken Language Understanding
In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), Singapore, 2009.

(Dinarelli et Al., EACL 2009a)
Marco Dinarelli, A. Moschitti, G. Riccardi
Reranking Models for Spoken Language Understanding
In Proceedings of the European chapter of the Association for Computational Linguistics (EACL), Athens, Greece, 2009.

(Dinarelli et Al., EACL 2009b)
Marco Dinarelli, S. Quarteroni, S. Tonelli, A. Moschitti, G. Riccardi
Annotating Spoken Dialogs: from Speech Segments to Dialog Acts and Frame Semantics
EACL Workshop on Semantic Representation of Spoken Language, Athens, Greece, 2009.

(Quarteroni et Al., ASRU 2009)
S. Quarteroni, Marco Dinarelli, G. Riccardi
Ontology-Based Grounding Of Spoken Language Understanding
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Merano, Italy, 2009.