More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

@article{Hassid2022MoreTW,
  title={More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech},
  author={Michael Hassid and Michelle Tadmor Ramanovich and Brendan Shillingford and Miaosen Wang and Ye Jia and Tal Remez},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022},
  pages={10577-10587}
}
In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes ad-vantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs… 

Figures and Tables from this paper

A Deep Dive Into Neural Synchrony Evaluation for Audio-visual Translation

The agreement of SyncNet scores vis-a-vis human perception is assessed and whether these can be used as a reliable metric for evaluating audio-visual lip-synchrony in generation tasks with no ground truth reference audio-video pair.

Merkel Podcast Corpus: A Multimodal Dataset Compiled from 16 Years of Angela Merkel’s Weekly Video Podcasts

It is argued that the Merkel Podcast Corpus, an audio-visual-text corpus in German collected from 16 years of (almost) weekly Internet podcasts of former German chancellor Angela Merkel, is a valuable contribution to the research community, in particular, due to its realistic and challenging material at the boundary between prepared and spontaneous speech.

References

SHOWING 1-10 OF 63 REFERENCES

Visualtts: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over

A novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization and outperforms all baseline systems is proposed.

Lipper: Synthesizing Thy Speech using Multi-View Lipreading

A multi-view lipreading to audio system, namely Lipper, is proposed, which models it as a regression task and observes an improvement over single-view speech reconstruction results.

Neural Dubber: Dubbing for Silent Videos According to Scripts

Experiments show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality and both qualitative and quantitative evaluations show that it can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

Lip Reading Sentences in the Wild

The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin, and it is demonstrated that if audio is available, then visual information helps to improve speech recognition performance.

Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis

This work proposes a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time and shows that its method is four times more intelligible than previous works in this space.

Large-scale multilingual audio visual dubbing

An architectural overview of the full system for large-scale audiovisual translation and dubbing is given, as well as an in-depth discussion of the video dubbing component.

Large-Scale Visual Speech Recognition

This work designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequence of phoneme distributions, and a production-level speech decoder that outputs sequences of words.

FastSpeech: Fast, Robust and Controllable Text to Speech

A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Tacotron: Towards End-to-End Speech Synthesis

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
...