Corpus ID: 227228485

Adaptive Compact Attention For Few-shot Video-to-video Translation

@article{Huang2020AdaptiveCA,
  title={Adaptive Compact Attention For Few-shot Video-to-video Translation},
  author={Risheng Huang and Li Shen and X. Wang and Chu-Hsing Lin and Haozhi Huang},
  journal={ArXiv},
  year={2020},
  volume={abs/2011.14695}
}
This paper proposes an adaptive compact attention model for few-shot video-to-video translation. Existing works in this domain only use features from pixel-wise attention without considering the correlations among multiple reference images, which leads to heavy computation but limited performance. Therefore, we introduce a novel adaptive compact attention mechanism to efficiently extract contextual features jointly from multiple reference images, of which encoded view-dependent and motion… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 35 REFERENCES
Few-shot Video-to-Video Synthesis
TLDR
A few-shot vid2vid framework is proposed, which learns to synthesize videos of previously unseen subjects or scenes by leveraging few example images of the target at test time by utilizing a novel network weight generation module utilizing an attention mechanism. Expand
Learning to Forecast and Refine Residual Motion for Image-to-Video Generation
TLDR
This work trains networks to learn residual motion between the current and future frames, which avoids learning motion-irrelevant details and proposes a two-stage generation framework where videos are generated from structures and then refined by temporal signals. Expand
Flow-Grounded Spatial-Temporal Video Prediction from Still Images
TLDR
This work forms the multi-frame prediction task as a multiple time step flow (multi-flow) prediction phase followed by a flow-to-frame synthesis phase, which prevents the model from directly looking at the high-dimensional pixel space of the frame sequence and is demonstrated to be more effective in predicting better and diverse results. Expand
Video-to-Video Synthesis
TLDR
This paper proposes a novel video-to-video synthesis approach under the generative adversarial learning framework, capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis. Expand
FUTUREGAN: ANTICIPATING THE FUTURE FRAMES OF VIDEO SEQUENCES USING SPATIO-TEMPORAL 3D CONVOLUTIONS IN PROGRESSIVELY GROWING GANS
  • Sandra Aigner, Marco Korner
  • Computer Science
  • The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
  • 2019
TLDR
A new encoder-decoder GAN model that predicts future frames of a video sequence conditioned on a sequence of past frames, applicable to various different datasets without additional changes, whilst achieving stable results that are competitive to the state-of-the-art in video prediction. Expand
FutureGAN: Anticipating the Future Frames of Video Sequences using Spatio-Temporal 3d Convolutions in Progressively Growing Autoencoder GANs
TLDR
A new encoder-decoder GAN model that predicts future frames of a video sequence conditioned on a sequence of past frames, applicable to various different datasets without additional changes, whilst achieving stable results that are competitive to the state-of-the-art in video prediction. Expand
Deep multi-scale video prediction beyond mean square error
TLDR
This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. Expand
Expectation-Maximization Attention Networks for Semantic Segmentation
TLDR
This paper forms the attention mechanism into an expectation-maximization manner and iteratively estimate a much more compact set of bases upon which the attention maps are computed, which is robust to the variance of input and is also friendly in memory and computation. Expand
ContextVP: Fully Context-Aware Video Prediction
TLDR
This work introduces a fully context-aware architecture that captures the entire available past context for each pixel using Parallel Multi-Dimensional LSTM units and aggregates it using blending units and yields state-of-the-art performance for next step prediction on three challenging real-world video datasets. Expand
DensePose: Dense Human Pose Estimation in the Wild
TLDR
This work establishes dense correspondences between an RGB image and a surface-based representation of the human body, a task referred to as dense human pose estimation, and improves accuracy through cascading, obtaining a system that delivers highly-accurate results at multiple frames per second on a single gpu. Expand
...
1
2
3
4
...