Mutual Information Maximization for Effective Lip Reading

  title={Mutual Information Maximization for Effective Lip Reading},
  author={Xingyuan Zhao and Shuang Yang and S. Shan and Xilin Chen},
  journal={2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)},
  • Xingyuan ZhaoShuang Yang Xilin Chen
  • Published 13 March 2020
  • Computer Science
  • 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)
Lip reading has received an increasing research interest in recent years due to the rapid development of deep learning and its widespread potential applications. One key point to obtain good performance for the lip reading task depends heavily on how effective the representation can be used to capture the lip movement information and meanwhile to resist the noises resulted by the change of pose, lighting conditions, speaker’s appearance, speaking speed and so on. Towards this target, we propose… 

Figures and Tables from this paper

Learn an Effective Lip Reading Model without Pains

A comprehensive quantitative study and comparative analysis is performed, for the first time, to show the effects of several different choices for lip reading, and finds that making proper use of these strategies could always bring exciting improvements without changing much of the model.

Learning the Relative Dynamic Features for Word-Level Lipreading

An efficient two-stream model was proposed to learn the relative dynamic information of lip motion and it was evaluated on two large-scale lipreading datasets and achieved a new state-of-the-art.

Advances and Challenges in Deep Lip Reading

A comprehensive survey of the state-of-the-art deep learning based VSR research with a focus on data challenges, task-specific complications, and the corresponding solutions.

Boosting Lip Reading with a Multi-View Fusion Network

A Multi-View Fusion Network (MVFN) is proposed, which can extract more discriminative visual representations by incorporating appearance and shape information of the lip region and achieves state-of-the-art performance.

Audio-Driven Deformation Flow for Effective Lip Reading

The results show that the proposed encoder-decoder architecture can not only improve the lip reading model’s performance without extra computation cost at the test phase, but also achieve higher performance than distilling from the ASR model directly which shows the advantages of the proposed deformation flow based method.

Tibetan lip reading based on D3D

The D3D algorithm is applied to the Tibetan lip reading and improves the feature extractor by changing the spatial convolution in DenseNet into the spatial-temporal convolution, which enhances the short-time dependent modeling ability of the model.

Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading

A new lip-reading model that combines three contributions, which includes a spatio-temporal attention mechanism to help extract the informative data from the input visual frames and a sequence-level and frame-level Knowledge Distillation techniques that allow leveraging audio data during the visual model training.

LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers

This work develops a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer, which can effectively enhance the model generalization to unseen speakers.

Chinese Mandarin Lipreading using Cascaded Transformers with Multiple Intermediate Representations

A cascaded Transformer-based model with a new cross-level attention mechanism is proposed, enriching the ways of information transmission between cascading structures and reducing the accumulation of errors.

Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

This work proposes a novel and competitive architecture for lip-reading, as a noticeable improvement in performance is demonstrated, setting a new benchmark equals to 88.64% on the LRW dataset.



Multi-Grained Spatio-temporal Modeling for Lip-reading

A novel lip-reading model which captures not only the nuance between words but also styles of different speakers, by a multi-grained spatio-temporal modeling of the speaking process is proposed.

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

This paper presents a naturally-distributed large-scale benchmark for lip-reading in the wild, named LRW-1000, which contains 1,000 classes with 718,018 samples from more than 2,000 individual speakers, and is currently the largest word-level lipreading dataset and also the only public large- scale Mandarin lip-read dataset.

LCANet: End-to-End Lipreading with Cascaded Attention-CTC

LCANet is proposed, an end-to-end deep neural network based lipreading system that incorporates a cascaded attention-CTC decoder to generate output texts and achieves notably performance improvement as well as faster convergence.

Lip Reading Sentences in the Wild

The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin, and it is demonstrated that if audio is available, then visual information helps to improve speech recognition performance.

Learning to lip read words by watching videos

High-Resolution Talking Face Generation via Mutual Information Approximation

A novel high-resolution talking face generation model for arbitrary person is proposed by discovering the cross-modality coherence via Mutual Information Approximation (MIA) by assuming the modality difference between audio and video is larger that of real video and generated video.

LipNet: End-to-End Sentence-level Lipreading

This work presents LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.

A review of recent advances in visual speech decoding

Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs

Toward movement-invariant automatic lip-reading and speech recognition

We present the development of a modular system for flexible human-computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading)