Towards Automatic Face-to-Face Translation

@article{KR2019TowardsAF,
  title={Towards Automatic Face-to-Face Translation},
  author={Prajwal K R and Rudrabha Mukhopadhyay and Jerin Philip and Abhishek Jha and Vinay P. Namboodiri and C. V. Jawahar},
  journal={Proceedings of the 27th ACM International Conference on Multimedia},
  year={2019}
}
In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work, we create an automatic pipeline for this problem and demonstrate its impact in multiple real… 

Figures and Tables from this paper

Translating sign language videos to talking faces

This work improves the existing sign language translation systems by using POS tags to improve language modeling and proposes a two-stage approach to translate sign language into intermediate text followed by a language model to get the final predictions.

Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation

This paper proposes a clean yet effective framework to generate pose-controllable talking faces whose poses are controllable by other videos and has multiple advanced capabilities including extreme view robustness and talking face frontalization.

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

This work proposes a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations of silent lip videos, and learns to synthesize speech sequences in any voice for the lip movements of any person.

Audio-Visual Face Reenactment

This work proposes a novel method to generate realistic talking head videos using audio and visual streams. We animate a source image by transferring head motion from a driving video using a dense

A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild

This work investigates the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment, and identifies key reasons pertaining to this and hence resolves them by learning from a powerful lip-sync discriminator.

Parallel and High-Fidelity Text-to-Lip Generation

A parallel decoding model for fast and high-fidelity text-to-lip generation (ParaLip) is proposed that predicts the duration of the encoded linguistic features and model the target lip frames conditioned on the encode linguistic features with their duration in a non-autoregressive manner and incorporates the structural similarity index loss and adversarial learning.

Intelligent video editing: incorporating modern talking face generation algorithms in a video editor

A video editor based on OpenShot with several state-of-the-art facial video editing algorithms as added functionalities providing an easy-to-use interface to apply modern lip-syncing algorithms interactively and tackling the critical aspect of synchronizing background content with the translated speech.

High-Speed and High-Quality Text-to-Lip Generation

A novel parallel decoding model for high-speed and high-quality text-tolip generation (HH-T2L) that incorporates the structural similarity index loss and adversarial learning to improve perceptual quality of generated lip frames and alleviate the blurry prediction problem.

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

Motivated by dubbing, VDTTS takes ad-vantage of video frames as an additional input alongside text, and generates speech that matches the video signal, allowing it to generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video.

Visual Speech Enhancement Without A Real Visual Stream

This work proposes a new paradigm for speech enhancement by exploiting recent breakthroughs in speech- driven lip synthesis by using one such model as a teacher network to produce accurate lip movements that mask away the noise, thus acting as a "visual noise filter".
...

References

SHOWING 1-10 OF 35 REFERENCES

Deep Audio-Visual Speech Recognition

This work compares two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss, built on top of the transformer self-attention architecture.

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

This work finds that the talking face sequence is actually a composition of both subject- related information and speech-related information, and learns disentangled audio-visual representation, which has an advantage where both audio and video can serve as inputs for generation.

Lip Movements Generation at a Glance

This paper devise a method to fuse audio and image embeddings to generate multiple lip images at once and propose a novel correlation loss to synchronize lip changes and speech changes and significantly outperforms other state-of-the-art methods extended to this task.

Neural Machine Translation by Jointly Learning to Align and Translate

It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

You said that?

An encoder-decoder CNN model is proposed that uses a joint embedding of the face and audio to generate synthesised talking face video frames and results of re-dubbing videos using speech from a different person are shown.

Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin

It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation

This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models.

Rapid Adaptation of Neural Machine Translation to New Languages

This paper proposes methods based on starting with massively multilingual “seed models”, which can be trained ahead-of-time, and then continuing training on data related to the LRL, leading to a novel, simple, yet effective method of “similar-language regularization”.

Video Rewrite: driving visual speech with audio

Video Rewrite is the first facial-animation system to automate all the labeling and assembly tasks required to resync existing footage to a new soundtrack.