An audio-visual corpus for speech perception and automatic speech recognition.

  title={An audio-visual corpus for speech perception and automatic speech recognition.},
  author={Martin Cooke and Jon Barker and Stuart P. Cunningham and Xu Shao},
  journal={The Journal of the Acoustical Society of America},
  volume={120 5 Pt 1},
An audio-visual corpus has been collected to support the use of common material in speech perception and automatic speech recognition studies. The corpus consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers. Sentences are simple, syntactically identical phrases such as "place green at B 4 now". Intelligibility tests using the audio signals suggest that the material is easily identifiable in quiet and low levels of stationary noise. The annotated… 

Figures and Tables from this paper

A Romanian corpus for speech perception and automatic speech recognition
A speech corpus is available in Romanian to use as the common material in speech perception and automatic speech recognition and preliminary intelligibility tests suggest that the collected speech is easily identifiable in quiet and low levels of noise.
Audio-visual speech recognition in the presence of non-stationary noise
An approach to robust automatic speech recognition that couples the problems of source separation and speech recognition by ‘piecing together’ spectro-temporal fragments of speech recovered from regions of a time-frequency representation in which the signal locally dominates the noise is extended.
TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech
The creation of a new corpus designed for continuous audio-visual speech recognition research, TCD-TIMIT, which consists of high-quality audio and video footage of 62 speakers reading a total of 6913 phonetically rich sentences is detailed.
An audio-visual corpus for multimodal automatic speech recognition
Results achieved with the developed audio-visual automatic speech recognition (ASR) engine trained and tested with the material contained in the corpus are presented and discussed together with comparative test results employing a state-of-the-art/commercial ASR engine.
Continuous visual speech recognition for audio speech enhancement
A novel non-blind speech enhancement procedure based on visual speech recognition (VSR) based on a generative process that analyzes sequences of talking faces and classifies them into visual speech units known as visemes that clearly outperforms baseline blind methods as well as related work.
Audio-Visual Speech Recognition and Synthesis
A pipeline for recognition-free retrieval is developed, and its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words is shown, and a user-based study is conducted, which validates the claim of better viewing experience in comparison to baseline methods.
Towards Robust Audio-Visual Speech Recognition
To achieve a speaker-independent visual speech recognizer, this thesis proposes to employ a pool of scale-invariant feature transform (SIFT) coefficients extracted from multiple color spaces.
Audio-visual speech fragment decoding
This paper presents a robust speech recognition technique called audio-visual speech fragment decoding (AV-SFD), in which the visual signal is exploited both as a cue for source separation and as a
Audio-visual speaker separation
Experimental results are presented that compare the proposed audio-visual speaker separation with the audio-only method using both speech quality and intelligibility metrics.
Generating Intelligible Audio Speech From Visual Speech
  • T. L. Cornu, B. Milner
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2017
This paper is concerned with generating intelligible audio speech from a video of a person talking. Regression and classification methods are proposed first to estimate static spectral envelope


Recent advances in the automatic recognition of audiovisual speech
The main components of audiovisual automatic speech recognition (ASR) are reviewed and novel contributions in two main areas are presented: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovISual speech integration.
SWITCHBOARD: telephone speech corpus for research and development
  • J. Godfrey, E. Holliman, J. McDaniel
  • Physics, Linguistics
    [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 1992
SWITCHBOARD is a large multispeaker corpus of conversational speech and text which should be of interest to researchers in speaker authentication and large vocabulary speech recognition. About 2500
A database for speaker-independent digit recognition
A large speech database has been collected for use in designing and evaluating algorithms for speaker independent recognition of connected digit sequences and formal human listening tests on this database provided certification of the labelling of the digit sequences.
Recognition of plosive syllables in noise: comparison of an auditory model with human performance.
An auditory model was developed which incorporated response threshold shifts and was interfaced to a hidden Markov model recognizer and tested with the same sounds that were employed in the human perception experiments, finding the recognition scores were greater with the threshold shifts than without them.
A glimpsing model of speech perception in noise.
  • M. Cooke
  • Physics
    The Journal of the Acoustical Society of America
  • 2006
An automatic speech recognition system, adapted for use with partially specified inputs, to identify consonants in noise revealed that cues to voicing are degraded more in the model than in human auditory processing.
A physical method for measuring speech-transmission quality.
The resulting index, the Speech-Transmission Index (STI), has been correlated with subjective intelligibility scores obtained on 167 different transmission channels with a wide variety of disturbances and the relative predictive power of the STI appeared to be 5%.
Informational and energetic masking effects in the perception of multiple simultaneous talkers.
The results of these experiments demonstrate how monaural factors may play an important role in the segregation of speech signals in multitalker environments.
Adequacy of auditory models to predict human internal representation of speech sounds.
  • O. Ghitza
  • Physics
    The Journal of the Acoustical Society of America
  • 1993
A diagnostic system has been developed that simulates the psychophysical procedure used in the standard Diagnostic-Rhyme Test (DRT) and provides detailed diagnostics that show the error distributions among six phonetically distinctive features.
A speech corpus for multitalker communications research.
A database of speech samples from eight different talkers has been collected for use in multitalker communications research. Descriptions of the nature of the corpus, the data collection methodology,
Speech intelligibility prediction in hearing-impaired listeners based on a psychoacoustically motivated perception model.
The underlying model is a first step toward a quantitative understanding of speech intelligibility and helps to distinguish between the influence of the "attenuation" and the "distortion" component of the hearing loss.