Silent versus modal multi-speaker speech recognition from ultrasound and video

@article{Ribeiro2021SilentVM,
  title={Silent versus modal multi-speaker speech recognition from ultrasound and video},
  author={Manuel Sam Ribeiro and Aciel Eshky and Korin Richmond and Steve Renals},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.00333}
}
We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques… Expand

Figures and Tables from this paper

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces
TLDR
Multi-speaker experiments using the recently published TaL80 corpus are presented, which adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos and found that the embedding vectors generalize nicely to unseen speakers. Expand

References

SHOWING 1-10 OF 33 REFERENCES
Speaker-independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech
TLDR
This work investigates the classification of phonetic segments (tongue shapes) from raw ultrasound recordings under several training scenarios: speaker-dependent, multi-speaker, speaker-independent, and speaker-adapted and observes that models underperform when applied to data from speakers not seen at training time. Expand
Impact of lack of acoustic feedback in EMG-based silent speech recognition
TLDR
This study compares EMG signals from audible, whispered, and silent speaking mode to distinguish between phonetic features like consonants and vowels and shows that the lack of acoustic feedback in silent speech implies an increased focus on somatosensoric feedback, which is visible in the EMG signal. Expand
TaLNet: Voice Reconstruction from Tongue and Lip Articulation with Transfer Learning from Text-to-Speech Synthesis
TLDR
Speech reconstructed using the proposed TaLNet method significantly outperformed all baselines (DNN, BLSTM and without transfer learning) in terms of both naturalness and intelligibility. Expand
A Study on Robustness of Articulatory Features for Automatic Speech Recognition of Neutral and Whispered Speech
TLDR
The robustness of articulatory features in ASR of neutral and whispered speech is explored using acoustic, articulatory, and integrated acoustic and articulatory feature vectors in matched and mismatched train-test cases and suggests that articulatory data contains information complementary to acoustic representations. Expand
Comparison of DCT and autoencoder-based features for DNN-HMM multimodal silent speech recognition
TLDR
Experimental results show that the two types of features achieve similar Word Error Rate, but that the autoencoder features maintain good performance even for very low-dimension feature vectors, demonstrating potential as a very compact representation of the information in multimodal silent speech data. Expand
TONGUE AND LIP MOTION PATTERNS IN VOICED, WHISPERED, AND SILENT VOWEL PRODUCTION
The speech production process during vocalized speech involves the integration of auditory and vocal motor sensory feedback. The aim of this study was to evaluate articulatory behaviour in differentExpand
Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips
TLDR
A segmental vocoder driven by ultrasound and optical images of the tongue and lips for a ''silent speech interface'' application, usable either by a laryngectomized patient or for silent communication. Expand
Visuo-phonetic decoding using multi-stream and context-dependent models for an ultrasound-based silent speech interface
TLDR
Improvements are presented for phonetic decoding of continuous-speech from ultrasound and optical observations of the tongue and lips in a silent speech interface application and the visual streams are modeled by context-dependent multi-stream Hidden Markov Models (CD-MSHMM). Expand
Tal: A Synchronised Multi-Speaker Corpus of Ultrasound Tongue Imaging, Audio, and Lip Videos
TLDR
The Tongue and Lips corpus is described and benchmark results for the tasks of speech recognition, speech synthesis (articulatory-to-acoustic mapping), and automatic synchronisation of ultrasound to audio are presented. Expand
End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge
This work is the first attempt to apply an end-to-end, deep neural network-based automatic speech recognition (ASR) pipeline to the Silent Speech Challenge dataset (SSC), which contains synchronizedExpand
...
1
2
3
4
...