Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images
@inproceedings{Saha2020Ultra2SpeechA, title={Ultra2Speech - A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images}, author={Pramit Saha and Yadong Liu and Bryan Gick and Sidney S. Fels}, booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention}, year={2020} }
Thousands of individuals need surgical removal of their larynx due to critical diseases every year and therefore, require an alternative form of communication to articulate speech sounds after the loss of their voice box. This work addresses the articulatory-to-acoustic mapping problem based on ultrasound (US) tongue images for the development of a silent-speech interface (SSI) that can provide them with an assistance in their daily interactions. Our approach targets automatically extracting…
6 Citations
Vocal tract area function extraction using ultrasound for articulatory speech synthesis
- Physics11th ISCA Speech Synthesis Workshop (SSW 11)
- 2021
This paper studies the feasibility of an articulatory speech synthesizer by extracting the mid-sagittal tongue and palate contours using the ultrasound (US) imaging modality. The extracted contours…
Reconstructing Speech from Real-Time Articulatory MRI Using Neural Vocoders
- Computer Science2021 29th European Signal Processing Conference (EUSIPCO)
- 2021
The results indicate that the approach can successfully reconstruct the gross spectral shape of the speech signal from a real-time MRI recording, but more improvements are needed to reproduce the fine spectral details.
Voice Activity Detection for Ultrasound-based Silent Speech Interfaces using Convolutional Neural Networks
- Computer ScienceTDS
- 2021
A convolutional neural network classifier is trained to separate silent and speech-containing ultrasound tongue images, using a conventional VAD algorithm to create the training labels from the corresponding speech signal.
Improved Processing of Ultrasound Tongue Videos by Combining ConvLSTM and 3D Convolutional Networks
- Computer ScienceIEA/AIE
- 2022
This paper experimentally compared various combinations of the above layer types for a silent speech interface task, and obtained the best result with a hybrid model that consists of a combination of 3D-CNN and ConvLSTM layers.
Learning Joint Articulatory-Acoustic Representations with Normalizing Flows
- Physics, Computer ScienceINTERSPEECH
- 2020
This paper aims at finding a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models, while simultaneously preserving the respective domain-specific features through convolutional autoencoder architecture and normalizing flow-based models.
Improving Neural Silent Speech Interface Models by Adversarial Training
- Computer ScienceAICV
- 2021
The results indicate that the application of the adversarial training loss brings about a slight, but consistent improvement in all these metrics.
References
SHOWING 1-10 OF 20 REFERENCES
Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI
- Computer Science, PhysicsINTERSPEECH
- 2018
Effective video action recognition techniques are employed to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract from a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers.
DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface
- Computer ScienceINTERSPEECH
- 2017
It is found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error.
Eigentongue Feature Extraction for an Ultrasound-Based Silent Speech Interface
- Physics2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07
- 2007
The article compares two approaches to the description of ultrasound vocal tract images for application in a "silent speech interface," one based on tongue contour modeling, and a second, global…
Autoencoder-Based Articulatory-to-Acoustic Mapping for Ultrasound Silent Speech Interfaces
- Computer Science2019 International Joint Conference on Neural Networks (IJCNN)
- 2019
The proposed method trains an autoencoder neural network on the ultrasound image, and can utilize several consecutive ultrasound images during estimation without a significant increase in the network size, while significantly increasing the accuracy of parameter estimation.
Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces
- Computer ScienceINTERSPEECH
- 2018
The results show that the parallel learning of the two types of targets is indeed beneficial for both tasks, and improvements are obtained by using multi-task training of deep neural networks as a weight initialization step before task-specific training.
Restoring speech following total removal of the larynx by a learned transformation from sensor data to acoustics.
- PhysicsThe Journal of the Acoustical Society of America
- 2017
It is shown that it may be possible to restore speech by sensing movement of the remaining speech articulators and using machine learning algorithms to derive a transformation to convert this sensor data into an acoustic signal.
Speech synthesis from real time ultrasound images of the tongue
- Physics2004 IEEE International Conference on Acoustics, Speech, and Signal Processing
- 2004
A machine learning technique is used to match reconstructed tongue contours in 30 frame per second ultrasound images to speaker vocal tract parameters obtained from a synchronized audio track. Speech…
Ultrasound-Based Silent Speech Interface Using Convolutional and Recurrent Neural Networks
- Computer ScienceActa Acustica united with Acustica
- 2019
A Deep Neural Network based SSI using ultrasound images of the tongue as input signals and spectral coefficients of a vocoder as target parameters are proposed and shown the best objective and subjective results.
Improving deep neural networks for LVCSR using rectified linear units and dropout
- Computer Science2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013
Modelling deep neural networks with rectified linear unit (ReLU) non-linearities with minimal human hyper-parameter tuning on a 50-hour English Broadcast News task shows an 4.2% relative improvement over a DNN trained with sigmoid units, and a 14.4% relative improved over a strong GMM/HMM system.
Formant Estimation and Tracking
- Physics
- 2008
The formal task of formant tracking is described in detail, and its successes and difficulties are explored, as well as giving reasons for the various approaches.