Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images

  title={Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images},
  author={Pramit Saha and Yadong Liu and Bryan Gick and Sidney S. Fels},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
Thousands of individuals need surgical removal of their larynx due to critical diseases every year and therefore, require an alternative form of communication to articulate speech sounds after the loss of their voice box. This work addresses the articulatory-to-acoustic mapping problem based on ultrasound (US) tongue images for the development of a silent-speech interface (SSI) that can provide them with an assistance in their daily interactions. Our approach targets automatically extracting… 

Vocal tract area function extraction using ultrasound for articulatory speech synthesis

This paper studies the feasibility of an articulatory speech synthesizer by extracting the mid-sagittal tongue and palate contours using the ultrasound (US) imaging modality. The extracted contours

Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders

The results indicate that the approach can successfully reconstruct the gross spectral shape of the speech signal from a real-time MRI recording, but more improvements are needed to reproduce the fine spectral details.

Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks

A convolutional neural network classifier is trained to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal.

Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks

This paper experimentally compared various combinations of the above layer types for a silent speech interface task, and obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers.

Learning Joint Articulatory-Acoustic Representations with Normalizing Flows

This paper aims at finding a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models, while simultaneously preserving the respective domain-specific features through convolutional autoencoder architecture and normalizing flow-based models.

Improving Neural Silent Speech Interface Models by Adversarial Training

The results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.



Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI

Effective video action recognition techniques are employed to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract from a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers.

DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface

It is found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error.

Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface

  • T. HueberG. Aversano M. Stone
  • Physics
    2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
  • 2007
The article compares two approaches to the description of ultrasound vocal tract images for application in a "silent speech interface," one based on tongue contour modeling, and a second, global

Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces

The proposed method trains an autoencoder neural network on the ultrasound image, and can utilize several consecutive ultrasound images during estimation without a significant increase in the network size, while significantly increasing the accuracy of parameter estimation.

Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces

The results show that the parallel learning of the two types of targets is indeed beneficial for both tasks, and improvements are obtained by using multi-task training of deep neural networks as a weight initialization step before task-specific training.

Restoring speech following total removal of the larynx by a learned transformation from sensor data to acoustics.

It is shown that it may be possible to restore speech by sensing movement of the remaining speech articulators and using machine learning algorithms to derive a transformation to convert this sensor data into an acoustic signal.

Speech synthesis from real time ultrasound images of the tongue

  • B. DenbyM. Stone
  • Physics
    2004 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 2004
A machine learning technique is used to match reconstructed tongue contours in 30 frame per second ultrasound images to speaker vocal tract parameters obtained from a synchronized audio track. Speech

Ultrasound-Based Silent Speech Interface Using Convolutional and Recurrent Neural Networks

A Deep Neural Network based SSI using ultrasound images of the tongue as input signals and spectral coefficients of a vocoder as target parameters are proposed and shown the best objective and subjective results.

Improving deep neural networks for LVCSR using rectified linear units and dropout

Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.

Formant Estimation and Tracking

The formal task of formant tracking is described in detail, and its successes and difficulties are explored, as well as giving reasons for the various approaches.