Dynamic Temporal Alignment of Speech to Lips

@article{Halperin2019DynamicTA,
  title={Dynamic Temporal Alignment of Speech to Lips},
  author={Tavi Halperin and Ariel Ephrat and Shmuel Peleg},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  pages={3980-3984}
}
  • Tavi Halperin, A. Ephrat, Shmuel Peleg
  • Published 19 August 2018
  • Computer Science
  • ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Many speech segments in movies are re-recorded in a studio during post-production, to compensate for poor sound quality as recorded on location. We present an audio-to-video method for automating speech to lips alignment, stretching and compressing the audio signal to match the lip movements. This alignment is based on deep audio-visual features, mapping the lips video and the speech signal to a shared representation. Using this representation we compute the lip-sync error between every short… Expand

Figures, Tables, and Topics from this paper

End to End Lip Synchronization with a Temporal AutoEncoder
  • Yoav Shalev, L. Wolf
  • Computer Science
  • 2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2020
TLDR
This work studies the problem of syncing the lip movement in a video with the audio stream and finds an optimal alignment using a dual-domain recurrent neural network trained on synthetic data generated by dropping and duplicating video frames. Expand
AUDIO-VISUAL ALIGNMENT MODEL WITH NEURAL NETWORK
Synchronizing audio and video has always been a timeconsuming and frustrating task for video editors. It often takes many hours and several times of re-editing back and forth to perfectly align theExpand
End-to-End Lip Synchronisation
TLDR
This work proposes an end-to-end trained network that can directly predict the offset between an audio stream and the corresponding video stream and demonstrates that the proposed approach outperforms the previous work by a large margin on LRS2 and LRS3 datasets. Expand
Neural Dubber: Dubbing for Videos According to Scripts
  • Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao
  • Engineering, Computer Science
  • 2021
TLDR
Experiments show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality and control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video. Expand
AlignNet: A Unifying Approach to Audio-Visual Alignment
TLDR
Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that the AlignNet method far outperforms the state-of- the-art methods. Expand
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
TLDR
A novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks is proposed. Expand
Resource-Adaptive Deep Learning for Visual Speech Recognition
TLDR
A novel recognition paradigm is proposed, called MultiRate Ensemble (MRE), that combines a “lean” and a ”full” MobiLipNetV3 in the lipreading pipeline, with the latter applied at a lower frame rate, allowing adaptation to the available device resources. Expand
Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation
TLDR
A new strategy for learning powerful cross-modal embeddings for audio-to-video synchronisation via a multi-way matching problem, as opposed to a binary classification (matching or non-matching) problem as proposed by recent papers. Expand
MobiLipNet: Resource-Efficient Deep Learning Based Lipreading
TLDR
This paper investigates the MobileNet convolutional neural network architectures, recently proposed for image classification, and extends the 2D convolutions of MobileNets to 3D ones, in order to better model the spatio-temporal nature of the lipreading problem. Expand
Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers
TLDR
This work introduces Drop-DTW, a novel algorithm that aligns the common signal between the sequences while automatically dropping the outlier elements from the matching in order to address sequence-to-sequence alignment for signals containing outliers. Expand
...
1
2
...

References

SHOWING 1-10 OF 43 REFERENCES
Looking to listen at the cocktail party
TLDR
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. Expand
Improved Speech Reconstruction from Silent Video
TLDR
This paper presents an end-to-end model based on a convolutional neural network for generating an intelligible and natural-sounding acoustic speech signal from silent video frames of a speaking person and shows promising results towards reconstructing speech from an unconstrained dictionary. Expand
The Conversation: Deep Audio-Visual Speech Enhancement
TLDR
A deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal. Expand
VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track
TLDR
This paper builds on high‐quality monocular capture of 3D facial performance, lighting and albedo of the dubbing and target actors, and uses audio analysis in combination with a space‐time retrieval method to synthesize a new photo‐realistically rendered and highly detailed 3D shape model of the mouth region to replace the target performance. Expand
TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech
TLDR
The creation of a new corpus designed for continuous audio-visual speech recognition research, TCD-TIMIT, which consists of high-quality audio and video footage of 62 speakers reading a total of 6913 phonetically rich sentences is detailed. Expand
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. Expand
Detecting audio-visual synchrony using deep neural networks
TLDR
This paper addresses the problem of automatically detecting whether the audio and visual speech modalities in frontal pose videos are synchronous or not, and investigates the use of deep neural networks (DNNs) for this purpose. Expand
Video Rewrite: driving visual speech with audio
TLDR
Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack. Expand
Automated lip-sync: Background and techniques
  • John Lewis
  • Computer Science
  • Comput. Animat. Virtual Worlds
  • 1991
TLDR
It is indicated that the automatic derivation of mouth movement from a speech soundtrack is a tractable problem and a common speech synthesis method, linear prediction, is adapted to provide simple and accurate phoneme recognition. Expand
Vid2speech: Speech reconstruction from silent video
  • A. Ephrat, Shmuel Peleg
  • Computer Science
  • 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
It is shown that by leveraging the automatic feature learning capabilities of a CNN, the model can obtain state-of-the-art word intelligibility on the GRID dataset, and show promising results for learning out- of-vocabulary (OOV) words. Expand
...
1
2
3
4
5
...