Learning the Best Pooling Strategy for Visual Semantic Embedding

@article{Chen2021LearningTB,
  title={Learning the Best Pooling Strategy for Visual Semantic Embedding},
  author={Jiacheng Chen and Hexiang Hu and Hao Wu and Yuning Jiang and Chang Lian Wang},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={15784-15793}
}
  • Jiacheng Chen, Hexiang Hu, C. Wang
  • Published 9 November 2020
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across… 
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
TLDR
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
TLDR
The HierArchical Multi-Modal EncodeR (HAMMER) is proposed that encodes a video at both the coarse-grained clip level and the fine- grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling.
Exploring Visual Engagement Signals for Representation Learning
TLDR
This paper presents VisE, a weakly supervised learning approach, which maps social images to pseudo labels derived by clustered engagement signals, and empirically demonstrates the effectiveness of VisE across a diverse set of classification tasks beyond the scope of conventional recognition.
Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval
TLDR
A Super Visual Semantic Embedding Network (SVSEN) for cross-modal image-text retrieval is proposed, which contains two independent branch substructures including the imageembedding network and the text embedding network.
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval
TLDR
This work lets the dual encoder provide hard negatives to the cross encoder, and use the more discriminative crossEncoder to distill its predictions back to the dualEncoder, efficiently performed together in the same model.
Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given
COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation
TLDR
The Contrastive Cross-Modal Knowledge Sharing Pretraining (COOKIE) method to learn universal text-image representations and achieves new state-of-the-art results while using only 3/1000 inference time comparing to one-stream models.
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
TLDR
This work proposes a novel CO llaborative vision-language pretraining model termed COTS for image-text retrieval by enhancing cross-modal interaction, which achieves the highest performance among all two-stream methods and comparable performance (but with 10,800 × faster in inference) w.r.t. the latest single- stream methods.
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
TLDR
This work constructs the Extended COCO Validation (ECCV) Caption dataset by supplying the missing associations with machine and human annotators, and re-evaluate the existing 25 VL models on existing and proposed benchmarks.
VALHALLA: Visual Hallucination for Machine Translation
TLDR
A visual hallucination framework, called VALHALLA, which requires only source sentences at inference time and instead uses hallucinated visual representations for multimodal machine translation and demonstrates the effectiveness of this approach over both text-only baselines and state-of-the-art methods.
...
...

References

SHOWING 1-10 OF 59 REFERENCES
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations
TLDR
The Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks and empowers the use of visual cues to accurately resolve word dependencies in novel sentences.
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
  • Yale Song, M. Soleymani
  • Computer Science
    2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
TLDR
Polysemous Instance Embedding Networks (PIE-Nets) are introduced that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
TLDR
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
Zero-Shot Learning by Convex Combination of Semantic Embeddings
TLDR
A simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary is proposed, which outperforms state of the art methods on the ImageNet zero-shot learning task.
VSE++: Improved Visual-Semantic Embeddings
TLDR
This paper introduces a very simple change to the loss function used in the original formulation by Kiros et al. (2014), which leads to drastic improvements in the retrieval performance, and shows that similar improvements also apply to the Order-embeddings by Vendrov etAl.
In Defense of Grid Features for Visual Question Answering
TLDR
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
Language-Agnostic Visual-Semantic Embeddings
TLDR
A novel character-based word- embedding approach, allowing the model to project similar words across languages into the same word-embedding space, and a novel cross-language alignment module that not only makes the architecture language-invariant, but also presents better predictive performance.
IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
TLDR
This paper proposes an Iterative Matching with Recurrent Attention Memory method, in which correspondences between images and texts are captured with multiple steps of alignments, and introduces an iterative matching scheme to explore such fine-grained correspondence progressively.
Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning
TLDR
A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.
...
...