• Corpus ID: 239886012

CATs: Cost Aggregation Transformers for Visual Correspondence

  title={CATs: Cost Aggregation Transformers for Visual Correspondence},
  author={Seokju Cho and Sunghwan Hong and Sangryul Jeon and Yunsung Lee and Kwanghoon Sohn and Seungryong Kim},
We propose a novel cost aggregation network, called Cost Aggregation Transformers (CATs), to find dense correspondences between semantically similar images with additional challenges posed by large intra-class appearance and geometric variations. Cost aggregation is a highly important process in matching tasks, which the matching accuracy depends on the quality of its output. Compared to handcrafted or CNN-based methods addressing the cost aggregation, in that either lacks robustness to severe… 
CATs++: Boosting Cost Aggregation with Convolutions and Transformers
The proposed CATs++, an extension of CATs, introduces early convolutions prior to cost aggregation with a transformer to control the number of tokens as well as to inject some convolutional inductive bias, and proposes a novel transformer architecture for both efficient and effective cost aggregation, which results in apparent performance boost and cost reduction.
Cost Aggregation Is All You Need for Few-Shot Segmentation
We introduce a novel cost aggregation network, dubbed Volumetric Aggregation with Transformers (VAT), to tackle the few-shot segmentation task by using both convolutions and transformers to
Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation
A novel cost aggregation network, called Volumetric Aggregation with Transformers (VAT), for few-shot segmentation, where a high-dimensional Swin Transformer is preceded by a series of small-kernel convolutions that impart local context to all pixels and introduce convolutional inductive bias.
GAN-Supervised Dense Visual Alignment
GANgealing significantly outperforms past self-supervised correspondence algorithms and performs on-par with state-of-the-art supervised correspondence algorithms on several datasets—without making use of any correspondence supervision or data augmentation and despite being trained exclusively on GAN-generated data.
AiATrack: Attention in Attention for Transformer Visual Tracking
This work proposes an attention in attention (AiA) module, which enhances appropriate correlations and suppresses erroneous ones by seeking consensus among all correlation vectors, and proposes a streamlined Transformer tracking framework, dubbed AiATrack, by introducing efficient feature reuse and target-background embeddings to make full use of temporal references.
HMFS: Hybrid Masking for Few-Shot Segmentation
This work compensates for the loss of fine-grained spatial details in FM technique by investigat-ing and leveraging a complementary basic input masking method, which shows improved performance against the current state-of-the-art methods by visible margins across different benchmarks.
Rewriting geometric rules of a GAN
This work enables a user to "warp" a given model by editing just a handful of original model outputs with desired geometric changes, enabling the creation of a new generative model without the burden of curating a large-scale dataset.
FlowFormer: A Transformer Architecture for Optical Flow
. We introduce Optical Flow TransFormer (FlowFormer), a transformer-based neural network architecture for learning optical flow. FlowFormer tokenizes the 4D cost volume built from an image pair,
Demystifying Unsupervised Semantic Correspondence Estimation
A new unsupervised correspondence approach is introduced which utilizes the strength of pre-trained features while encouraging better matches during training, which results in significantly better matching performance compared to current state-of-the-art methods.
Examining Responsibility and Deliberation in AI Impact Statements and Ethics Reviews
The artificial intelligence research community is continuing to grapple with the ethics of its work by encouraging researchers to discuss potential positive and negative consequences. Neural


Universal Correspondence Network
A convolutional spatial transformer to mimic patch normalization in traditional features like SIFT is proposed, which is shown to dramatically boost accuracy for semantic correspondences across intra-class shape variations.
Correspondence Networks With Adaptive Neighbourhood Consensus
This paper proposes a convolutional neural network architecture, called adaptive neighbourhood consensus network (ANC-Net), that can be trained end-to-end with sparse key-point annotations, to handle the task of establishing dense visual correspondences between images containing objects of the same category.
SFNet: Learning Object-Aware Semantic Correspondence
A new CNN architecture is proposed, dubbed SFNet, which leverages a new and differentiable version of the argmax function for end-to-end training, with a loss that combines mask and flow consistency with smoothness terms.
Learning to Compose Hypercolumns for Visual Correspondence
A novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match by selecting a small number of relevant layers from a deep convolutional neural network is introduced.
PARN: Pyramidal Affine Regression Networks for Dense Semantic Correspondence
A deep architecture for dense semantic correspondence, called pyramidal affine regression networks (PARN), that estimates locally-varying affine transformation fields across images and proposes a novel weakly-supervised training scheme that generates progressive supervisions by leveraging a correspondence consistency across image pairs.
FCSS: Fully Convolutional Self-Similarity for Dense Semantic Correspondence
To robustly match points among different instances within the same object class, FCSS is formulated using local self-similarity (LSS) within a fully convolutional network, which is inherently insensitive to intra-class appearance variations because of its LSS-based structure.
Dynamic Context Correspondence Network for Semantic Alignment
This paper proposes a context-aware semantic representation that incorporates spatial layout for robust matching against local ambiguities and develops a novel dynamic fusion strategy based on attention mechanism to weave the advantages of both local and context features by integrating semantic cues from multiple scales.
FCSS: Fully Convolutional Self-Similarity for Dense Semantic Correspondence
This work proposes to leverage object candidate priors provided in most existing datasets and also correspondence consistency between object pairs to enable weakly-supervised learning and significantly outperforms conventional handcrafted descriptors and CNN-based descriptors on various benchmarks.
Neighbourhood Consensus Networks
An end-to-end trainable convolutional neural network architecture that identifies sets of spatially consistent matches by analyzing neighbourhood consensus patterns in the 4D space of all possible correspondences between a pair of images without the need for a global geometric model is developed.
Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions
The proposed Sparse-NCNet method obtains state-of-the-art results on the HPatches Sequences and InLoc visual localisation benchmarks, and competitive results in the Aachen Day-Night benchmark.