[PhD] Thesis in Natural Language Processing: Introduction of semantic information in a speech recognition system

PhD Thesis in Natural Language Processing:

Introduction of semantic information in a speech recognition system

Supervisors: Irina Illina, MdC, Dominique Fohr, CR CNRS

Team: Multispeech, LORIA-INRIA (https://team.inria.fr/multispeech/)

Contact: illina@loria.fr, dominique.fohr@loria.fr

Deadline to apply : May 15th, 2019

Required skills: Background in mathematics, machine learning (DNN), statistics, natural language processing and computer program skills (Perl, Python).

English writing and speaking skills are required in any case.

Candidates should email a detailed CV with diploma


LORIA is the French acronym for the “Lorraine Research Laboratory in Computer Science and its Applications” and is a research unit (UMR 7503), common to CNRS, the University of Lorraine and INRIA. This unit was officially created in 1997. Loria’s missions mainly deal with fundamental and applied research in computer sciences.

MULTISPEECH is a joint research team between the Université of Lorraine, Inria, and CNRS. Its research focuses on speech processing, with particular emphasis to multisource (source separation, robust speech recognition), multilingual (computer assisted language learning), and multimodal aspects (audiovisual synthesis).

Context and objectives

Under noisy conditions, audio acquisition is one of the toughest challenges to have a successful automatic speech recognition (ASR). One possible approach relies on the ability to attenuate ambient noise in the signal and to take it into account in the acoustic model used by the ASR. Our DNN (Deep Neural Network) denoising system and our approach to exploiting uncertainties have shown their combined effectiveness against noisy speech. To go further and to improve the performance of the automatic speech recognition in noisy conditions, we propose to use semantic or thematic information. The addition of semantic information will remove ambiguities due to the background noise.

Semantic and thematic spaces are vector spaces used for representation number of words, sentences or textual documents. The corresponding models and methods have a long history in the field of computational linguistics and natural language processing [Turney and Pantel, 2010]. Almost all models rely on the hypothesis of statistical semantics which states that: Statistical patterns of appearance of words (context of a word) can be used to describe the underlying semantics. The most used method to learn these representations is to predict a word using the context in which this word appears [Mikolov et al., 2013b, Pennington et al., 2014], and this can to be realized with neural networks. These representations have proved effective for a series of natural language processing [Baroni et al., 2014]. In particular, Mikolov’s Skip-gram and CBOW models et al. [Mikolov et al., 2013b, Mikolov et al., 2013a] have become very popular because of their ability to process large amounts of unstructured text data with reduced computing costs. Efficiency and the semantic properties of these representations motivate us to explore these semantic representations for our task of recognition in noisy conditions.


Main activities

The goal of this PhD Thesis will be devoted to the innovative study of the taking into account of semantics through predictive representations that capture the semantic features of words and their context. Research will be conducted on the combination of semantic information with information from denoising to improve speech recognition.

The ASR stage will be supplemented by a semantic analysis to detect the words of the processed sentence that could have been misrecognized and to offer similar (at the acoustic level) words that better fit the context. Predictive representations using continuous vectors have been shown to capture the semantic characteristics of words and their context, and to overcome representations based on counting words. Semantic analysis will be performed by combining predictive representations using continuous vectors and information from denoising. This combination could be done by the rescoring component. All our models will be based on the powerful paradigm of DNN. The performances of the various modules will be evaluated on artificially noisy speech signals and on real noisy data.


[Baroni et al., 2014] Baroni, M., Dinu, G., and Kruszewski, G. Don’t count, predict! a systematic comparison ofcontext-counting vs. contextpredicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers), pages 238-247, 2014.

[Mikolov et al., 2013a] Mikolov, T. Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space, CoRR, vol. abs/1301.3781, 2013.

[Mikolov et al., 2013b] Mikolov, T., Sutskever, I., Chen, T. Corrado, G.S.,and Dean, J. Distributed representations of wordsand phrases and their compositionality, in Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.

[Nathwani et al., 2018] Nathwani, K., Vincent, E., and Illina, I. DNN uncertainty propagation using GMM-derived uncertainty features for noise robust ASR, IEEE Signal Processing Letters, 2018.

[Nathwani et al., 2017] Nathwani, K., Vincent, E., and Illina, I. Consistent DNN uncertainty training and decoding for robust ASR, in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, 2017.

[Nugraha et al., 2016] Nugraha, A., Liutkus, A., Vincent E. Multichannel audio source separation with deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016.

[Pennington et al., 2014] Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543.

[Peters et al., 2018a] Peters M., Neumann M, Zettlemoyer L., Yih W.  Dissecting Contextual Word Embeddings: Architecture and Representation, EMNLP, 2018.

[Peters et al., 2018b] Peters, M.,  Neumann, M., Iyyer, M.,  Gardner, M.,   Clark, C., Lee, K. Zettlemoyer L.  Deep contextualized word representations. NAACL-HLT, 2018.

[Ruder, 2019] Ruder S. Neural Transfer Learning for Natural Language Processing, PhD Thesis, National  University  of Ireland, Galway, 2019.

[Sheikh, 2016] Sheikh, I. Exploitation du contexte sémantique pour améliorer la reconnaissance des noms propres dans les documents audio diachroniques”, These de doctorat en Informatique, Université de Lorraine, 2016.

[Sheikh et al., 2016] Sheikh, I. Illina, I. Fohr, D. Linares, G. Learning word importance with the neural bag-of-words model, in Proc. ACL Representation Learning for NLP (Repl4NLP) Workshop, Aug 2016.

[Turney et al., 2010] Turney, P. D. and Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. J.Artif. Int. Res., 37(1):141-188.


Logo du CNRS

Logo d'Inria

Logo Université de Lorraine