Corpus ID: 235367982

NWT: Towards natural audio-to-video generation with representation learning

@article{Mama2021NWTTN,
  title={NWT: Towards natural audio-to-video generation with representation learning},
  author={Rayhane Mama and M. S. Tyndel and Hashiam Kadhim and Cole Clifford and Ragavan Thurairatnam},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.04283}
}
In this work we introduce NWT, an expressive speech-to-video model. Unlike approaches that use domain-specific intermediate representations such as pose keypoints, NWT learns its own latent representations, with minimal assumptions about the audio and video content. To this end, we propose a novel discrete variational autoencoder with adversarial loss, dVAE-Adv, which learns a new discrete latent representation we call Memcodes. Memcodes are straightforward to implement, require no additional… Expand

References

SHOWING 1-10 OF 65 REFERENCES
Generating Diverse High-Fidelity Images with VQ-VAE-2
TLDR
It is demonstrated that a multi-scale hierarchical organization of VQ-VAE, augmented with powerful priors over the latent codes, is able to generate samples with quality that rivals that of state of the art Generative Adversarial Networks on multifaceted datasets such as ImageNet, while not suffering from GAN's known shortcomings such as mode collapse and lack of diversity. Expand
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
TLDR
The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Expand
Adversarial Video Generation on Complex Datasets
TLDR
This work shows that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work. Expand
GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions
TLDR
GODIVA is proposed, an open-domain textto-video pretrained model that can generate videos from text in an auto-regressive manner using a three-dimensional sparse attention mechanism and a new metric called Relative Matching to automatically evaluate the video generation quality. Expand
Neural Discrete Representation Learning
TLDR
Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations. Expand
Improved Variational Autoencoders for Text Modeling using Dilated Convolutions
TLDR
It is shown that with the right decoder, VAE can outperform LSTM language models, and perplexity gains are demonstrated on two datasets, representing the first positive experimental result on the use VAE for generative modeling of text. Expand
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
TLDR
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis. Expand
LumièreNet: Lecture Video Synthesis from Audio
We present LumiereNet, a simple, modular, and completely deep-learning based architecture that synthesizes, high quality, full-pose headshot lecture videos from instructor's new audio narration ofExpand
Unsupervised Speech Representation Learning Using WaveNet Autoencoders
TLDR
A regularization scheme is introduced that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task. Expand
Improving variational autoencoder with deep feature consistent and generative adversarial training
TLDR
A generative adversarial training mechanism to force the variational autoencoder (VAE) to output realistic and natural images and a multi-view feature extraction strategy to extract effective image representations which can be used to achieve state of the art performance in facial attribute prediction. Expand
...
1
2
3
4
5
...