Mutual Information Maximization for Effective Lip Reading
@article{Zhao2020MutualIM, title={Mutual Information Maximization for Effective Lip Reading}, author={Xingyuan Zhao and Shuang Yang and S. Shan and Xilin Chen}, journal={2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)}, year={2020}, pages={420-427} }
Lip reading has received an increasing research interest in recent years due to the rapid development of deep learning and its widespread potential applications. One key point to obtain good performance for the lip reading task depends heavily on how effective the representation can be used to capture the lip movement information and meanwhile to resist the noises resulted by the change of pose, lighting conditions, speaker’s appearance, speaking speed and so on. Towards this target, we propose…
Figures and Tables from this paper
29 Citations
Learn an Effective Lip Reading Model without Pains
- Computer ScienceArXiv
- 2020
A comprehensive quantitative study and comparative analysis is performed, for the first time, to show the effects of several different choices for lip reading, and finds that making proper use of these strategies could always bring exciting improvements without changing much of the model.
Learning the Relative Dynamic Features for Word-Level Lipreading
- Computer ScienceSensors
- 2022
An efficient two-stream model was proposed to learn the relative dynamic information of lip motion and it was evaluated on two large-scale lipreading datasets and achieved a new state-of-the-art.
Advances and Challenges in Deep Lip Reading
- Computer ScienceArXiv
- 2021
A comprehensive survey of the state-of-the-art deep learning based VSR research with a focus on data challenges, task-specific complications, and the corresponding solutions.
Boosting Lip Reading with a Multi-View Fusion Network
- Computer Science2022 IEEE International Conference on Multimedia and Expo (ICME)
- 2022
A Multi-View Fusion Network (MVFN) is proposed, which can extract more discriminative visual representations by incorporating appearance and shape information of the lip region and achieves state-of-the-art performance.
Audio-Driven Deformation Flow for Effective Lip Reading
- Computer Science2022 26th International Conference on Pattern Recognition (ICPR)
- 2022
The results show that the proposed encoder-decoder architecture can not only improve the lip reading model’s performance without extra computation cost at the test phase, but also achieve higher performance than distilling from the ASR model directly which shows the advantages of the proposed deformation flow based method.
Tibetan lip reading based on D3D
- Computer Science2021 2nd International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE)
- 2021
The D3D algorithm is applied to the Tibetan lip reading and improves the feature extractor by changing the spatial convolution in DenseNet into the spatial-temporal convolution, which enhances the short-time dependent modeling ability of the model.
Spatio-Temporal Attention Mechanism and Knowledge Distillation for Lip Reading
- Computer ScienceIJCCI
- 2021
A new lip-reading model that combines three contributions, which includes a spatio-temporal attention mechanism to help extract the informative data from the input visual frames and a sequence-level and frame-level Knowledge Distillation techniques that allow leveraging audio data during the visual model training.
LipFormer: Learning to Lipread Unseen Speakers based on Visual-Landmark Transformers
- Computer ScienceArXiv
- 2023
This work develops a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer, which can effectively enhance the model generalization to unseen speakers.
Chinese Mandarin Lipreading using Cascaded Transformers with Multiple Intermediate Representations
- Computer Science2022 IEEE International Conference on Image Processing (ICIP)
- 2022
A cascaded Transformer-based model with a new cross-level attention mechanism is proposed, enriching the ways of information transmission between cascading structures and reducing the accumulation of errors.
Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models
- Computer ScienceArXiv
- 2022
This work proposes a novel and competitive architecture for lip-reading, as a noticeable improvement in performance is demonstrated, setting a new benchmark equals to 88.64% on the LRW dataset.
References
SHOWING 1-10 OF 28 REFERENCES
Multi-Grained Spatio-temporal Modeling for Lip-reading
- Computer ScienceBMVC
- 2019
A novel lip-reading model which captures not only the nuance between words but also styles of different speakers, by a multi-grained spatio-temporal modeling of the speaking process is proposed.
LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild
- Computer Science2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019)
- 2019
This paper presents a naturally-distributed large-scale benchmark for lip-reading in the wild, named LRW-1000, which contains 1,000 classes with 718,018 samples from more than 2,000 individual speakers, and is currently the largest word-level lipreading dataset and also the only public large- scale Mandarin lip-read dataset.
LCANet: End-to-End Lipreading with Cascaded Attention-CTC
- Computer Science2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018)
- 2018
LCANet is proposed, an end-to-end deep neural network based lipreading system that incorporates a cascaded attention-CTC decoder to generate output texts and achieves notably performance improvement as well as faster convergence.
Lip Reading Sentences in the Wild
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin, and it is demonstrated that if audio is available, then visual information helps to improve speech recognition performance.
High-Resolution Talking Face Generation via Mutual Information Approximation
- Computer ScienceArXiv
- 2018
A novel high-resolution talking face generation model for arbitrary person is proposed by discovering the cross-modality coherence via Mutual Information Approximation (MIA) by assuming the modality difference between audio and video is larger that of real video and generated video.
LipNet: End-to-End Sentence-level Lipreading
- Computer Science
- 2016
This work presents LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.
Pushing the boundaries of audiovisual word recognition using Residual Networks and LSTMs
- Computer ScienceComput. Vis. Image Underst.
- 2018
Toward movement-invariant automatic lip-reading and speech recognition
- Computer Science1995 International Conference on Acoustics, Speech, and Signal Processing
- 1995
We present the development of a modular system for flexible human-computer interaction via speech. The speech recognition component integrates acoustic and visual information (automatic lip-reading)…