Marco Dinarelli with his first journal publication in a IEEE review    Marco Dinarelli
Web site of Marco Dinarelli in English  Site web de Marco Dinarelli en français  Sito web di Marco Dinarelli in italiano 


LaTTiCe-CNRS
UMR 8094
Office 9
1 rue Maurice Arnoux
92120 Montrouge, France

Email:
marco [dot] dinarelli [at] ens [dot] fr
marco [dot] dinarelli [at] gmail [dot] com

                        Curriculum Vitae           Profile of Marco Dinarelli in LinkedIn


Latest news

2017 / 05 / 17:
3 Short papers accepted at the TALN 2017 French conference

2017 / 04 / 22:
Best Verifiability, Reproducibility, and Working Description award at the CICling 2017 conference (International Conference on Intelligent Text Processing and Computational Linguistics)

Research interests

  • Automatic semantic content extraction
  • Machine Learning
  • Natural Language Processing (NLP)
  • Structured features design for NLP
  • Probabilistic models, in particular Neural Networks, Stochastic Finite State Machines (FSM), Conditional Random Fields (CRF), Support Vector Machines (SVM), probabilistic grammars

Research projects

  • ANR DEMOCRAT January 2016 - December 2019
    DEscription et MOdélisation des Chaïnes de Référence : outils pour l'Annotation de corpus (en diachronie et en langues comparées) et le Traitement automatique
Previous projects

Activities

Supervising

Teaching

Research applications

Extended Named Entity Detection

Named Entity Detection is a well-known NLP task used as preliminary step to extract semantic information, to be used in more complex application. Beyond simple named entity detection tasks like the CoNLL shared task 2003, during last years more complex named entity sets have been defined, e.g. the one described in (Sekine and Nobata, 2004). Despite the complexity of the entity sets, most of the named entity detection tasks defined in the last years, can be tackled more or less as sequence labeling tasks.
During the first part of my post-doc at LIMSI-CNRS, I have been working on a new set of named entities defined within the project Quaero. This new set of named entities is described in (Grouin et Al., 2011), and its main difference with respect to previous entity sets is constituted by entities having complex tree-structures, where simple entity components can be combined to have complex and higher level entities.
Given such tree structure, the task cannot be tackled as sequence labeling, which makes it more difficult. A further contribution to make the task harder is the kind of data used for annotating the named entity: transcriptions of French broadcast data, coming from different French and North-African radio channels.
In order to address these issues, after trying approaches coming from syntactic parsing without success, I used an approach combining the robustness of Conditional Random Fields (CRF) (Lafferty et Al.,2001) in sequence labeling tasks, with the ability of syntactic parsing algorithms (e.g. (Charniak, 1997)) to generate tree structures from flat sequences in an effective way, even on noisy data.
My approach uses CRF models to tag simple entity components on words, while a Probabilistic Context-Free Grammar (PCFG) along with a chart-parsing algorithm reconstruct the whole entity tree. The advantage with this approach is that CRFs are particularly effective for sequence labeling and robust to noisy data, they can thus provide an accurate annotation even using noisy data like French broadcast news. Once the words are annotated with entity components, since entity trees are far simpler than syntactic trees, even a simple model like PCFG is effective for parsing entity trees.
This approach has been evaluated in the 2011 Quaero named entity detection evaluation campaign, ranking first by a large margin.
Details about this approach are described in (Dinarelli Rosset, IJCNLP 2011). Recently, some advances have been published in (Dinarelli Rosset, EACL 2012), where several different tree structures have been used in order to encode some context in the PCFG. The same approach has been also recently applied to OCR-ized data dating from 1890, after a preprocessing step described in details in (Dinarelli Rosset, LREC 2012)

Spoken Dialog Systems

Spoken Dialog Systems (SDS) are speech applications allowing humans to engage a dialog with a machine in order to solve a task.
During my Ph.D. I've been working on the LUNA project SDS prototype, in particular I designed the understanding module of the application. The main goal was to develop an evolution of a simple call routing application in Italian, in the domain of hardware/software problem solving. The understanding module of the application integrates state-of-the-art Spoken Language Understanding models, complemented with a sentence classifier.
Once the system understands the problem, as belonging to one of 10 possible scenarios, it redirects the user to an operator able to provide further assistance.
For more details see (Dinarelli et Al., ICASSP 2010).

Ontology-Based Spoken Language Understanding

From a computer science perspective, an ontology is a taxonomy of classes linked by some relations. In a Spoken Language Understanding (SLU) context, classes are semantic classes, or concepts, relations are semantic relations between concepts.
Beyond traditional ontology relations, e.g. "is-a" or "part-of", we have defined some specific relations among concepts taken from the Italian corpus of Spoken Dialogs described in (Dinarelli et Al., EACL 2009b).
The corpus covers the domain of problem solving for hardware/software repairing and has been used for the development and evaluation of Spoken Language Understanding systems (see e.g. (Dinarelli et Al., EACL 2009a)).
We used the ontology semantic relations in order to assess semantic interpretation hypotheses generated by a baseline SLU system based on Stochastic Finite State Transducers, like the one described in (Dinarelli et Al., EACL 2009a). We choose the most consistent hypothesis with respect to the Ontology Relatedness measure defined in (Quarteroni et Al., ASRU 2009).
Despite the final results in terms of accuracy were not improving state-of-the-art, this idea received good feedback at Interspeech 2009 conference and ASRU 2009 workshop.

Ph.D. Thesis

The topic of my Ph.D. was Spoken Language Understanding (SLU) models for Spoken Dialog Systems. The work focused on the integration of different SLU models using discriminative re-ranking algorithms (Collins,2000).
Two models for hypotheses generation were used: Stochastic Finite State Transducers (SFST), encoding a semantic language model (Raymond et Al.,2006), and Conditional Random Fields (CRF) (Lafferty et Al.,2001). The re-ranking model was based on Support Vector Machines (Vapnik,1998) with Kernel Methods, in particular String Kernels (Shawe-Taylor&Cristianini,2004) and Tree Kernels (Collins&Duffy,2001) (Moschitti,2006).
New tree-structured features for kernels have been designed with the aim of giving an effective representation of SLU hypotheses in SVM (Dinarelli et Al., EMNLP 2009).
An important contribution to reranking is the hypotheses selection criteria: a heuristic providing a semantic inconsistency metric over hypotheses allowing to select the best hypotheses among those generated by SFST or CRF, for details see (Dinarelli et Al., SLT 2010), (Dinarelli Rosset, EMNLP 2011), and (Dinarelli et Al., IEEE 2011).
The joint models based on reranking have been evaluated on 4 different corpora in 4 different languages: ATIS (English), MEDIA (French), Italian and Polish corpora acquired within the European project LUNA (see (Dinarelli et Al., EACL 2009b) for the Italian corpus). An exhaustive comparison with several state-of-the-art models has been performed, showing the effectiveness of reranking models, all details are in my Ph.D. dissertation (Dinarelli, Ph.D. Dissertation 2010).

Master Degree Thesis

During my Master Thesis I have studied, implemented and evaluated an application for data clustering and compression.
Data compression algorithms can be thought of as functions transforming data so that to reduce local redundancy. The data redundancy is detected by the compression algorithm inside a window on the input data stream. Redundancy detection is limited to this window, this can constitute a serious limitation when compressing relatively large amount of data. Common data compression algorithms, like the Lempel-Ziv algorithm family used in zip and gzip Linux tools, or algorithms using the Burrows-Wheeler Transform (BWT) like bzip2 Linux tool, use a fixed-size window (e.g. the options -1,...,-9, used in mutual exclusion, fix the window size to 100K,...,900K).
A possible way to improve the compression performance is to increase the window size. Unfortunately this solution increases also the compression time, that in the worst case cannot be bounded a priori.
The solution studied in the thesis works on the opposite point of view: instead of arbitrarily increasing the window size in order to detect data redundancies far away in the documents, we apply a fast data clustering algorithm putting (possibly) close together similar sub-parts of documents, thus increasing data local redundancy. After the clustering phase, data are compressed using a variable-size window algorithm. The window size bound has been computed empirically with a set of experiments where increasing window size was used. The clusterisation phase has been performed using min-wise independent linear permutations (Bohman, Cooper, Frieze 2000) to convert document sub-parts into feature vectors. These were then mapped into one-dimensional real number space using Locality Sensitive Hashing (LSH) (Andoni, Indyk 2006). Exploiting LSH properties (similar vectors, and so similar documents sub-parts, are hashed close together in the real line), we just re-sort document sub-parts using hash values order, thus getting possibly highly redundant data. The final data compression step is performed with an algorithm based on the BWT, provided by my advisor Professor Paolo Ferragina

Bibliography

(Dinarelli et Al., IEEE 2012)
Marco Dinarelli, A. Moschitti, G. Riccardi
Discriminative Reranking for Spoken Language Understanding
IEEE Journal of Transactions on Audio, Speech and Language Processing (TASLP), volume 20, issue 2, pages 526 - 539, 2012.

(Dinarelli Rosset, LREC 2012)
Marco Dinarelli, S. Rosset
Tree-Structured Named Entity Recognition on OCR Data: Analysis, Processing and Results
In Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul, Turkey, 2012.

(Dinarelli Rosset, EACL 2012)
Marco Dinarelli, S. Rosset
Tree Representations in Probabilistic Models for Extended Named Entity Detection
In Proceedings of the European chapter of the Association for Computational Linguistics (EACL), Avignon, France, 2012.

(Dinarelli Rosset, IJCNLP 2011)
Marco Dinarelli, S. Rosset
Models Cascade for Tree-Structured Named Entity Detection
In Proceedings of International Joint Conference on Natural Language Processing (IJCNLP), Chiang Mai, Thailand, 2011.

(Dinarelli Rosset, EMNLP 2011)
Marco Dinarelli, S. Rosset
Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), Edinburgh, U.K., 2011.

(Dinarelli et Al., SLT 2010)
Marco Dinarelli, A. Moschitti, G. Riccardi
Hypotheses Selection For Re-ranking Semantic Annotations
IEEE Workshop on Spoken Language Technology (SLT), Berkeley, U.S.A., 2010.

(Dinarelli, Ph.D. Dissertation 2010)
Marco Dinarelli
Spoken Language Understanding: from Spoken Utterances to Semantic Structures
Ph.D. Dissertation, University of Trento
Department of Computer Science and Information Engineering (DISI), Italy, 2010.

(Dinarelli et Al., ICASSP 2010)
Marco Dinarelli, E. Stepanov, S. Varges, G. Riccardi
The LUNA Spoken Dialog System: Beyond Utterance Classification
In Proceedings of International Conference of Acoustics, Speech and Signal Processing (ICASSP), Dallas, USA, 2010.

(Dinarelli et Al., EMNLP 2009)
Marco Dinarelli, A. Moschitti, G. Riccardi
Reranking Models Based On Small Training Data For Spoken Language Understanding
In Proceedings of Empirical Methods for Natural Language Processing (EMNLP), Singapore, 2009.

(Dinarelli et Al., EACL 2009a)
Marco Dinarelli, A. Moschitti, G. Riccardi
Reranking Models for Spoken Language Understanding
In Proceedings of the European chapter of the Association for Computational Linguistics (EACL), Athens, Greece, 2009.

(Dinarelli et Al., EACL 2009b)
Marco Dinarelli, S. Quarteroni, S. Tonelli, A. Moschitti, G. Riccardi
Annotating Spoken Dialogs: from Speech Segments to Dialog Acts and Frame Semantics
EACL Workshop on Semantic Representation of Spoken Language, Athens, Greece, 2009.

(Quarteroni et Al., ASRU 2009)
S. Quarteroni, Marco Dinarelli, G. Riccardi
Ontology-Based Grounding Of Spoken Language Understanding
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Merano, Italy, 2009.