• Corpus ID: 233231721

Decoupled Spatial-Temporal Transformer for Video Inpainting

  title={Decoupled Spatial-Temporal Transformer for Video Inpainting},
  author={R. Liu and Hanming Deng and Yangyi Huang and Xiaoyu Shi and Lewei Lu and Wenxiu Sun and Xiaogang Wang and Jifeng Dai and Hongsheng Li},
Video inpainting aims to fill the given spatiotemporal holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches. Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance. However, it still suffers from synthesizing blurry texture as well as huge computational cost. Towards this end, we propose a novel Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting… 

Figures and Tables from this paper

Spatial-Temporal Residual Aggregation for High Resolution Video Inpainting
STRA-Net is proposed, a novel spatial-temporal residual aggregation framework for high resolution video inpainting that can produce more temporal-coherent and visually appealing results than the state-of-the-art methods on inPainting high resolution videos.
A Temporal Learning Approach to Inpainting Endoscopic Specularities and Its effect on Image Correspondence
This paper proposes using a temporal generative adversarial network (GAN) to inpaint the hidden anatomy under specularities, inferring its appearance spatially and from neighbouring frames where they are not present in the same location.
Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition
Seamless combination of these novel designs forms a robust spatialtemporal representation and achieves better performance than state-of-the-art methods on four public motion datasets.
Visual Attention Network
A novel large kernel attention (LKA) module is proposed to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues and a novel neural network based on LKA is introduced, namely Visual Attention Network (VAN).
Attention Mechanisms in Computer Vision: A Survey
This survey provides a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention, and branch attention.
MABC‐EPF: Video in‐painting technique with enhanced priority function and optimal patch search algorithm
The proposed video in‐painting technique with enhanced priority function and optimal patch searching algorithm works better than the other art state applications and the proposed peak signal to noise ratio (PSNR) attains better 8.06%, 7.90%, 32.15%, and 13.06% compared with other previous methods.
Towards An End-to-End Framework for Flow-Guided Video Inpainting
This work proposes an End-to-End framework for Flow-Guided Video Inpainting (E 2 FGVI) through elaborately designed three trainable modules, namely, flow completion, feature propagation, and content hallucination modules, which can be jointly optimized, leading to a more efficient and effective inpainting process.


Learning Joint Spatial-Temporal Transformations for Video Inpainting
This paper simultaneously fill missing regions in all input frames by self-attention, and proposes to optimize STTN by a spatial-temporal adversarial loss to show the superiority of the proposed model.
Deep Video Inpainting
This work proposes a novel deep network architecture for fast video inpainting built upon an image-based encoder-decoder model that is designed to collect and refine information from neighbor frames and synthesize still-unknown regions.
Deep Flow-Guided Video Inpainting
This work first synthesizes a spatially and temporally coherent optical flow field across video frames using a newly designed Deep Flow Completion network, then uses the synthesized flow fields to guide the propagation of pixels to fill up the missing regions in the video.
Learnable Gated Temporal Shift Module for Deep Video Inpainting
This paper presents a novel component termed Learnable Gated Temporal Shift Module (LGTSM) for video inpainting models that could effectively tackle arbitrary video masks without additional parameters from 3D convolutions.
Video Inpainting by Jointly Learning Temporal Structure and Spatial Details
A novel deep learning architecture is proposed which contains two subnetworks: a temporal structure inference network and a spatial detail recovering network, which jointly trains both sub-networks in an end-to-end manner.
Copy-and-Paste Networks for Deep Video Inpainting
A novel DNN-based framework called the Copy-and-Paste Networks for video inpainting that takes advantage of additional information in other frames of the video to significantly improve the lane detection accuracy on road videos.
An Internal Learning Approach to Video Inpainting
We propose a novel video inpainting algorithm that simultaneously hallucinates missing appearance and motion (optical flow) information, building upon the recent 'Deep Image Prior' (DIP) that
Generative Image Inpainting with Contextual Attention
This work proposes a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions.
Proposal-Based Video Completion
This paper uses 3D convolutions to obtain an initial inpainting estimate which is subsequently refined by fusing a generated set of proposals which provide a rich source of information that permits combining similarly looking patches that may be spatially and temporally far from the region to be inpainted.
Learning Blind Video Temporal Consistency
An efficient approach based on a deep recurrent network for enforcing temporal consistency in a video that can handle multiple and unseen tasks, including but not limited to artistic style transfer, enhancement, colorization, image-to-image translation and intrinsic image decomposition.