[PhD Offer – 2019] Deep supervision of the vocal tract shape for articulatory synthesis of speech

Title: Deep supervision of the vocal tract shape for articulatory synthesis of speech

Team: MultiSpeech

Supervisor: Yves Laprie

Keywords: articulatory synthesis, real-time MRI images, articulatory modeling, deep learning


The production of speech requires a signal source, i.e. the vibration of vocal folds or a noise of turbulence in the vocal tract, and a system of resonant cavities, i.e. the vocal tract. Speech articulators (jaw, tongue, lips, larynx, soft palate and epiglottis) are used to modify the shape of the vocal tract, and therefore the acoustic properties including the resonances of the vocal tract. When learning speech or a second language, speakers learn how to mobilize and control articulators to produce intelligible speech.

Articulatory synthesis mimics this process by using as inputs the deformations of the vocal tract, and the parameters of vocal fold control over time. The interest of articulatory synthesis is to explain the articulatory origin of phonetic contrasts, enable changing the movement of articulators (or even block one of them), modify the control parameters of vocal folds, enable a realistic adaptation to a new speaker by modifying the size and shape of the articulators, and finally give access to physical quantities (e.g. pressure) in the vocal tract for example) without requiring the introduction of sensors in the vocal tract.

Compared to other approaches to synthesis that offer a high level of quality, the strength of articulatory synthesis is above all to control the entire process of speech production.

The generation of the geometric shape of the vocal tract at each time point of the synthesis is most often based on the use of an articulatory model [1,2] that gives the shape of the tract with a small number of parameters. Each of the parameters corresponds to a deformation mode of the articulator considered, and the tongue being the most deformable articulator requires at least 6 six parameters. The articulatory model is constructed from about 100 static MRI images of the vocal tract.

Description of work

Recently we have been equipped with a two-dimensional real-time MRI acquisition system (at 55 images per second) for the vocal tract as part of a collaboration with the IADI laboratory (INSERM U1254) at Nancy hospital, and a database of several hours of speech for a speaker.

The quality of these images of the mid-sagittal shape of the vocal tract is very goof and it is therefore possible to track the contour of the articulators automatically [4,5,6]. We want to track each of the articulators independently of the others because speech involves complex compensatory and coordinating gestures that would be hidden if the vocal tract is processed in one piece[7].

The most important part of the work will be devoted to controlling the shape of the vocal tract. The idea is to develop a deep learning approach to determine the position of the articulators according to the phonemes to be articulated. The constraint is to be able to identify the role of each articulator in sufficient detail so as to be able to control its impact on the overall shape of the vocal tract, and to study coordination and compensation strategies between the articulators.

The abduction and adduction gestures of the vocal folds can be recorded using electro-photoglottography [8] and, as for the articulatory parameters, it will be possible to learn them according to the sequence of phonemes to be articulated. These two data streams will be fed into digital acoustic simulations [9] to verify the quality of the speech produced, and to study the articulatory factors of expressive speech.


[1] B. J. Kröger, V. Graf-Borttscheller, A. Lowit. (2008). Two- and Three-Dimensional Visual Articulatory Models for Pronunciation Training and for Treatment of Speech Disorders, Proc. Of Interspeech 2008, Brisbane, Australia

[2] Y. L aprie, J. Busset. (2011). Construction and evaluation of an articulatory model of the vocal tract, In : 19th European Signal Processing Conference – EUSIPCO-2011. – Barcelona, Spain

[4] A. Jaumard-Hakoun, K. Xu, P. Roussel, G. Dreyfus, M. Stone and B. Denby. Tongue contour extraction from ultrasound images based on deep neulral network. Proc. of International Congress of Phonetic Sciences, Glasgow, 2015.

[5] I. Fasel and J. Berry. Deep Belied Networks for Real-Time Extraction of Tongue Contours from Ultrasound During Speech. Proc. of 20th ICPR, Istanbul, 2010.

[6] G. Litjens, T. Kooi et al. A survey on deep learning in medical image analysis. Medical Image Analysis, 42 :60-88, 2017.

[7] A. J. Gully, T. Yoshimura, D.T. Murphy, K. Hashimoto, Y. Nankaku,and K. Tokuda. (2017). Articulatory Text-to-Speech Synthesis using the Digital Waveguide Meshdriven by a Deep Neural Network, INTERSPEECH, Stokholm

[8] K. Honda and S. Maeda. (2008). Glottal-opening and airflow pattern during production of voiceless fricatives: A new non-invasive instrumentation. Journal of the Acoustical Society of America, 123(5):3788.

[9] B. Elie, Y Laprie. (2016). Extension of the single-matrix formulation of the vocal tract : consideration of bilateral channels and connection of self-oscillating models of the vocal folds with a glottal chink. – Speech Communication 82, pp. 85–96.

Required skills

computer sciences, deep learning, automatic speech processing, applied mathematics

Logo du CNRS

Logo d'Inria

Logo Université de Lorraine