Learning the Best Pooling Strategy for Visual Semantic Embedding
@article{Chen2021LearningTB, title={Learning the Best Pooling Strategy for Visual Semantic Embedding}, author={Jiacheng Chen and Hexiang Hu and Hao Wu and Yuning Jiang and Chang Lian Wang}, journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2021}, pages={15784-15793} }
Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across…
Figures and Tables from this paper
21 Citations
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- Computer ScienceICML
- 2021
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss, and it is shown that the scale of the corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
- Computer ScienceArXiv
- 2020
The HierArchical Multi-Modal EncodeR (HAMMER) is proposed that encodes a video at both the coarse-grained clip level and the fine- grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling.
Exploring Visual Engagement Signals for Representation Learning
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This paper presents VisE, a weakly supervised learning approach, which maps social images to pseudo labels derived by clustered engagement signals, and empirically demonstrates the effectiveness of VisE across a diverse set of classification tasks beyond the scope of conventional recognition.
Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval
- Computer ScienceCSAE
- 2021
A Super Visual Semantic Embedding Network (SVSEN) for cross-modal image-text retrieval is proposed, which contains two independent branch substructures including the imageembedding network and the text embedding network.
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval
- Computer ScienceArXiv
- 2022
This work lets the dual encoder provide hard negatives to the cross encoder, and use the more discriminative crossEncoder to distill its predictions back to the dualEncoder, efficiently performed together in the same model.
Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
- Computer ScienceIEEE Transactions on Circuits and Systems for Video Technology
- 2022
This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given…
COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
The Contrastive Cross-Modal Knowledge Sharing Pretraining (COOKIE) method to learn universal text-image representations and achieves new state-of-the-art results while using only 3/1000 inference time comparing to one-stream models.
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
- Computer ScienceArXiv
- 2022
This work proposes a novel CO llaborative vision-language pretraining model termed COTS for image-text retrieval by enhancing cross-modal interaction, which achieves the highest performance among all two-stream methods and comparable performance (but with 10,800 × faster in inference) w.r.t. the latest single- stream methods.
ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO
- Computer ScienceArXiv
- 2022
This work constructs the Extended COCO Validation (ECCV) Caption dataset by supplying the missing associations with machine and human annotators, and re-evaluate the existing 25 VL models on existing and proposed benchmarks.
VALHALLA: Visual Hallucination for Machine Translation
- Computer ScienceArXiv
- 2022
A visual hallucination framework, called VALHALLA, which requires only source sentences at inference time and instead uses hallucinated visual representations for multimodal machine translation and demonstrates the effectiveness of this approach over both text-only baselines and state-of-the-art methods.
References
SHOWING 1-10 OF 59 REFERENCES
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
The Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks and empowers the use of visual cues to accurately resolve word dependencies in novel sentences.
Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval
- Computer Science2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2019
Polysemous Instance Embedding Networks (PIE-Nets) are introduced that compute multiple and diverse representations of an instance by combining global context with locally-guided features via multi-head self-attention and residual learning.
DeViSE: A Deep Visual-Semantic Embedding Model
- Computer ScienceNIPS
- 2013
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
- Computer ScienceArXiv
- 2014
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
Zero-Shot Learning by Convex Combination of Semantic Embeddings
- Computer ScienceICLR
- 2014
A simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary is proposed, which outperforms state of the art methods on the ImageNet zero-shot learning task.
VSE++: Improved Visual-Semantic Embeddings
- Computer ScienceArXiv
- 2017
This paper introduces a very simple change to the loss function used in the original formulation by Kiros et al. (2014), which leads to drastic improvements in the retrieval performance, and shows that similar improvements also apply to the Order-embeddings by Vendrov etAl.
In Defense of Grid Features for Visual Question Answering
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This paper revisits grid features for VQA, and finds they can work surprisingly well -- running more than an order of magnitude faster with the same accuracy (e.g. if pre-trained in a similar fashion).
Language-Agnostic Visual-Semantic Embeddings
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
A novel character-based word- embedding approach, allowing the model to project similar words across languages into the same word-embedding space, and a novel cross-language alignment module that not only makes the architecture language-invariant, but also presents better predictive performance.
IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This paper proposes an Iterative Matching with Recurrent Attention Memory method, in which correspondences between images and texts are captured with multiple steps of alignments, and introduces an iterative matching scheme to explore such fine-grained correspondence progressively.
Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.