Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

@article{Miech2021ThinkingFA,
  title={Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers},
  author={Antoine Miech and Jean-Baptiste Alayrac and Ivan Laptev and Josef Sivic and Andrew Zisserman},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={9821-9831}
}
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales and is efficient for billions of images using approximate nearest neighbour search. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings, but is often… 
Inflate and Shrink:Enriching and Reducing Interactions for Fast Text-Image Retrieval
TLDR
Through an inflating operation followed by a shrinking operation, both efficiency and accuracy of a late-interaction model are boosted.
Assorted Attention Network for Cross-Lingual Language-to-Vision Retrieval
TLDR
An assorted attention network (A2N) is proposed to synchronously overcome the language gap, bridge the modal gap and fuse features of two modals in an elegant and effective manner in the cross-lingual language-to-vision retrieval task.
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval
TLDR
This work lets the dual encoder provide hard negatives to the cross encoder, and use the more discriminative crossEncoder to distill its predictions back to the dualEncoder, efficiently performed together in the same model.
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
TLDR
This paper proposes X-DETR, whose architecture has three major components: an object detector, a language encoder, and vision-language alignment, which shows good accuracy and fast speeds for multiple instance-wise vision- language tasks, e.g., 16.4 AP on LVIS detection of 1.2K categories at ∼ 20 frames per second without using any LVIS annotation during training.
Image Retrieval from Contextual Descriptions
TLDR
A new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe), where models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description, revealing that these models dramatically lag behind human performance.
Cross Modal Retrieval with Querybank Normalisation
TLDR
This work forms a simple but effective framework called Querybank Normalisation ( QB-N ORM) that re-normalises query similarities to account for hubs in the embedding space, and proposes a novel similarity normalisation method, the Dynamic Inverted Softmax, that is significantly more robust than existing approaches.
Local-Global Context Aware Transformer for Language-Guided Video Segmentation
TLDR
LOCATER (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner, and outperforms previous state-of-the-arts solutions.
Cascaded Fast and Slow Models for Efficient Semantic Code Search
TLDR
An efficient and accurate semantic code search framework with cascaded fast and slow models, in which a fast transformer encoder model is learned to optimize a scalable index for fast retrieval followed by learning a slow classification-based re-ranking model to improve the performance of the top K results from the fast retrieval.
A CLIP-Hitchhiker's Guide to Long Video Retrieval
TLDR
It is found that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement above all prior temporal modelling attempts and mean-pooling.
Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval
TLDR
This paper proposes a novel iterative domain alignment method by means of pseudo-labelling target videos and cross-domain (i.e. source-target) ranking, which adapts the embedding space to the target gallery, consistently outperforming source-only as well as marginal and conditional alignment methods.
...
...

References

SHOWING 1-10 OF 81 REFERENCES
Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval
TLDR
A novel fine-tuning framework that turns any pretrained text-image multi-modal model into an efficient retrieval model, based on a cooperative retrieve-and-rerank approach that combines a twin networks and a cross-encoder component for a more nuanced ranking of the retrieved small set of items.
12-in-1: Multi-Task Vision and Language Representation Learning
TLDR
This work develops a large-scale, multi-task model that culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multimodal verification and shows that finetuning task-specific models from this model can lead to further improvements, achieving performance at or above the state-of-the-art.
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
TLDR
This paper proposes to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions by building a separate multi-modal embedding space for each PoS tag, which enables learning specialised embedding spaces that offer multiple views of the same embedded entities.
Dual Encoding for Zero-Example Video Retrieval
TLDR
This paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own and establishes a new state-of-the-art for zero-example video retrieval.
Enhancing Video Summarization via Vision-Language Embedding
TLDR
This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story, by extending a recent submodular summarization approach with representativeness and interestingness objectives computed on features from a joint vision-language embedding space.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Support-set bottlenecks for video-text representation learning
TLDR
This paper proposes a novel method that leverages a generative model to naturally push related samples together, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning.
Learning Two-Branch Neural Networks for Image-Text Matching Tasks
TLDR
This paper investigates two-branch neural networks for learning the similarity between image-sentence matching and region-phrase matching, and proposes two network structures that produce different output representations.
Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning
TLDR
A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
...
...