• Corpus ID: 238531533

Efficient large-scale image retrieval with deep feature orthogonality and Hybrid-Swin-Transformers

  title={Efficient large-scale image retrieval with deep feature orthogonality and Hybrid-Swin-Transformers},
  author={Christof Henkel},
We present an efficient end-to-end pipeline for largescale landmark recognition and retrieval. We show how to combine and enhance concepts from recent research in image retrieval and introduce two architectures especially suited for large-scale landmark identification. A model with deep orthogonal fusion of local and global features (DOLG) using an EfficientNet backbone as well as a novel Hybrid-Swin-Transformer is discussed and details how to train both architectures efficiently using a step… 

Figures and Tables from this paper


Unifying Deep Local and Global Features for Efficient Image Search
This work unify global and local image features into a single deep model, enabling scalable retrieval with high accuracy, and introduces an autoencoder-based dimensionality reduction technique for local features, which is integrated into the model, improving training efficiency and matching performance.
Large-Scale Image Retrieval with Attentive Deep Local Features
An attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELE (DEep Local Feature), based on convolutional neural networks, which are trained only with image-level annotations on a landmark image dataset.
DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features
This paper proposes a Deep Orthogonal Local and Global (DOLG) information fusion framework for end-to-end image retrieval that achieves state-of-the-art image retrieval performances on Revisited Oxford and Paris datasets.
Large-scale Landmark Retrieval/Recognition under a Noisy and Diverse Dataset
This work presents a novel landmark retrieval/recognition system, robust to a noisy and diverse dataset, based on deep convolutional neural networks with metric learning, trained by cosine-softmax based losses.
Fine-Tuning CNN Image Retrieval with No Human Annotation
It is shown that both hard-positive and hard-negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance of particular-object retrieval.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
A new vision Transformer is presented that capably serves as a general-purpose backbone for computer vision and has the flexibility to model at various scales and has linear computational complexity with respect to image size.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.
Albumentations: fast and flexible image augmentations
Albumentations is presented, a fast and flexible open source library for image augmentation with many various image transform operations available that is also an easy-to-use wrapper around other augmentation libraries.
ArcFace: Additive Angular Margin Loss for Deep Face Recognition.
Extensive experiments demonstrate that ArcFace can enhance the discriminative feature embedding as well as strengthen the generative face synthesis, and explore the inverse problem, mapping feature vectors to face images.
Google Landmark Recognition 2020 Competition Third Place Solution
We present our third place solution to the Google Landmark Recognition 2020 competition. It is an ensemble of global features only Sub-center ArcFace models. We introduce dynamic margins for ArcFace