Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

  title={Recurrent Neural Network Transducer for Audio-Visual Speech Recognition},
  author={T. Makino and H. Liao and Yannis M. Assael and Brendan Shillingford and Basi Garc{\'i}a and Otavio Braga and O. Siohan},
  journal={2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  • T. Makino, H. Liao, +4 authors O. Siohan
  • Published 2019
  • Computer Science, Engineering
  • 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from… Expand
9 Citations
AV Taris: Online Audio-Visual Speech Recognition
  • PDF
End-to-end Audio-visual Speech Recognition with Conformers
  • Highly Influenced
  • PDF
Audio-visual Multi-channel Integration and Recognition of Overlapped Speech
  • PDF
Audio-visual Multi-channel Recognition of Overlapped Speech
  • 6
  • PDF
End-to-End Multi-Person Audio/Visual Automatic Speech Recognition
Automatic speech recognition: a survey
ASR is All You Need: Cross-Modal Distillation for Lip Reading
  • 9
  • Highly Influenced
  • PDF
Now You're Speaking My Language: Visual Language Identification
  • PDF
M3F: Multi-Modal Continuous Valence-Arousal Estimation in the Wild
  • 7
  • PDF


Deep Audio-Visual Speech Recognition
  • 157
  • Highly Influential
  • PDF
Large-Scale Visual Speech Recognition
  • 44
  • PDF
Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization
  • S. Zhang, Ming Lei, B. Ma, Lei Xie
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
  • 11
An audio-visual corpus for multimodal automatic speech recognition
  • 41
  • PDF
Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture
  • 29
  • PDF
Looking to listen at the cocktail party
  • 277
  • PDF
Recent advances in the automatic recognition of audiovisual speech
  • 717
  • PDF
A Comparison of Sequence-to-Sequence Models for Speech Recognition
  • 157
Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription
  • 135
  • PDF
The Conversation: Deep Audio-Visual Speech Enhancement
  • 133
  • PDF