End-to-End Audiovisual Fusion with LSTMs

@inproceedings{Petridis2017EndtoEndAF,
  title={End-to-End Audiovisual Fusion with LSTMs},
  author={Stavros Petridis and Yujiang Wang and Zuwei Li and Maja Pantic},
  booktitle={AVSP},
  year={2017}
}
Several end-to-end deep learning approaches have been recently presented which simultaneously extract visual features from the input images and perform visual speech classification. However, research on jointly extracting audio and visual features and performing classification is very limited. In this work, we present an end-to-end audiovisual model based on Bidirectional Long Short-Term Memory (BLSTM) networks. To the best of our knowledge, this is the first audiovisual fusion model which… 

Figures and Tables from this paper

End-to-End Audiovisual Speech Recognition

TLDR
This is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW).

C V ] 2 2 Fe b 20 18 END-TO-END AUDIOVISUAL SPEECH RECOGNITION

TLDR
This is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW).

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

TLDR
This work presents a novel two-stream end-to-end framework fusing features extracted from images via VGG-M with raw Mel Frequency Cepstrum Coefficients features extraction from the audio waveform, and indicates that the new feature extraction strategy shows more robustness to noisy signals and better inference time than models that employed ConvNet on both modalities.

Robust Audio-Visual Speech Recognition Based on Hybrid Fusion

TLDR
This work proposes a novel hybrid fusion based AVSR method with residual networks and Bidirectional Gated Recurrent Unit (BGRU), which is able to distinguish homophones in both clean and noisy conditions and introduces a combined loss, which shows its noise-robustness in learning the joint representation across various modalities.

Audiovisual speech recognition: A review and forecast

TLDR
It is argued that end-to-end audiovisual speech recognition model and deep learning-based feature extractors will guide multimodality human–computer interaction directly to a solution.

End-to-End Sentence-Level Multi-View Lipreading Architecture with Spatial Attention Module Integrated Multiple CNNs and Cascaded Local Self-Attention-CTC

TLDR
The proposed end-to-end sentence-level multi-view VSR architecture for faces captured from four different perspectives uses multiple convolutional neural networks with a spatial attention module to detect minor changes in the mouth patterns of similarly pronounced words and boosts its usefulness in real-world applications.

Audio-Visual Transformer Based Crowd Counting

TLDR
A new audiovisual multi-task network is proposed to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction.

WaveNet With Cross-Attention for Audiovisual Speech Recognition

TLDR
WaveNet is extented to audiovisual speech recognition, and the cross-attention mechanism is introduced into different places of WaveNet for feature fusion to address multimodal feature fusion and frame alignment problems between two data streams.

References

SHOWING 1-10 OF 31 REFERENCES

End-to-end visual speech recognition with LSTMS

TLDR
This work presents an end-to-end visual speech recognition system based on Long-Short Memory (LSTM) networks, which is the first model which simultaneously learns to extract features directly from the pixels and perform classification and also achieves state-of-the-art performance in visual speech classification.

Deep multimodal learning for Audio-Visual Speech Recognition

TLDR
An approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built is studied, demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio.

Temporal Multimodal Learning in Audiovisual Speech Recognition

TLDR
A novel temporal multi-modality deep learning architecture, named as Recurrent Temporal Multimodal RB-M (RTMRBM), that models multimodal sequences by transforming the sequence of connected MRBMs into a probabilistic series model and can obviously improve the accuracy of recognition compared with standard MRBM and the temporal model based on conditional RBM.

Prediction-Based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations

TLDR
This work trains predictive models which model the spatiotemporal relationship between audio and visual features by learning the audio-to-visual and visual- to-audio feature mapping for each class and performs cross-database experiments, using the AMI, SAL, and MAHNOB databases, in order to classify laughter.

Integration of deep bottleneck features for audio-visual speech recognition

TLDR
This paper proposes a method of integrating DBNFs using multi-stream HMMs in order to improve the performance of AVSRs under both clean and noisy conditions and evaluates the method using a continuously spo-ken, Japanese digit recognition task under matched and mismatched conditions.

LipNet: End-to-End Sentence-level Lipreading

TLDR
This work presents LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.

Multimodal Deep Learning

TLDR
This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

Audio-Visual Speech Modeling for Continuous Speech Recognition

TLDR
A speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments and is demonstrated on a large multispeaker database of continuously spoken digits.

Recent advances in the automatic recognition of audiovisual speech

TLDR
The main components of audiovisual automatic speech recognition (ASR) are reviewed and novel contributions in two main areas are presented: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovISual speech integration.

Audio-Visual Speech Recognition Using Bimodal-Trained Bottleneck Features for a Person with Severe Hearing Loss

TLDR
A novel visual feature extraction approach that connects the lip image to audio features efficiently, and the use of convolutive bottleneck networks (CBNs) increases robustness with respect to speech fluctuations caused by hearing loss.