Corpus ID: 220280312

Modality-Agnostic Attention Fusion for visual search with text feedback

@article{Dodds2020ModalityAgnosticAF,
  title={Modality-Agnostic Attention Fusion for visual search with text feedback},
  author={Eric Dodds and Jack Culpepper and Simao Herdade and Yang Zhang and Kofi Boakye},
  journal={ArXiv},
  year={2020},
  volume={abs/2007.00145}
}
Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications such as e-commerce. Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs competitively on a dataset with only single-word modifications… Expand
Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models
TLDR
CIRPLANT is proposed, a transformer based model that leverages rich pre-trained vision-and-language (V&L) knowledge for modifying visual features conditioned on natural language that outperforms existing methods on open-domain images, while matching state-of-theart accuracy on the existing narrow datasets, such as fashion. Expand
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
  • Mingchen Zhuge, D. Gao, +5 authors L. Shao
  • Computer Science
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
TLDR
A new vision-language (VL) pre-training model dubbed Kaleido-BERT is presented, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers, and design alignment guided masking to jointly focus more on image-text semantic relations. Expand
RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network
TLDR
A novel method that combines the graph convolutional network (GCN) with existing composition methods with a simple new architecture using skip connections that can effectively encode the errors between the source and target images in the latent space is introduced. Expand
Heterogeneous Feature Fusion and Cross-modal Alignment for Composed Image Retrieval
TLDR
An end-to-end framework for composed image retrieval, which consists of three key components including Multi-modal Complementary Fusion (MCF), Cross- modal Guided Pooling (CGP), and Relative Caption-aware Consistency (RCC), which achieves outstanding performance against the state-of-the-art methods. Expand
Image Change Captioning by Learning from an Auxiliary Task
  • M. Hosseinzadeh, Yang Wang
  • 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
We tackle the challenging task of image change captioning. The goal is to describe the subtle difference between two very similar images by generating a sentence caption. While the recent methodsExpand
M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product Downstream Tasks
  • Xiao Dong, Xunlin Zhan, +4 authors Xiaodan Liang
  • Computer Science
  • ArXiv
  • 2021
TLDR
A baseline model M5-MMT is provided that makes the first attempt to integrate the different modality configuration into an unified model for feature fusion to address the great challenge for semantic alignment. Expand
Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback
TLDR
This paper introduces the Fashion IQ dataset, the first fashion dataset to provide human-generated captions that distinguish similar pairs of garment images together with side-information consisting of real-world product descriptions and derived visual attribute labels for these images, and provides a detailed analysis of the characteristics. Expand

References

SHOWING 1-10 OF 58 REFERENCES
Image Search With Text Feedback by Visiolinguistic Attention Learning
TLDR
This work proposes a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics, thus yielding an expressive representation for effective image search. Expand
The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback
TLDR
A novel approach is proposed and empirically demonstrated that combining natural language feedback with visual attribute information results in superior user feedback modeling and retrieval performance relative to using either of these modalities. Expand
Fusion of Detected Objects in Text for Visual Question Answering
TLDR
A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture. Expand
Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
TLDR
This paper proposes a novel adaptive attention model with a visual sentinel that sets the new state-of-the-art by a significant margin on image captioning. Expand
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
TLDR
This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB. Expand
UNITER: Learning UNiversal Image-TExt Representations
TLDR
UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets is introduced, which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. Expand
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval
TLDR
The fashion matching is required to pay much more attention to the fine-grained information in the fashion images and texts, so FashionBERT, which leverages patches as image features, is proposed, which learns high level representations of texts and images. Expand
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations
TLDR
The Mutual Iterative Attention module is built, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities, and demonstrates that the proposed approach is effective and generalizes well to a wide range of models for image-related applications. Expand
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
TLDR
A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA. Expand
Dialog-based Interactive Image Retrieval
TLDR
A new approach to interactive image search that enables users to provide feedback via natural language, allowing for more natural and effective interaction, and achieves better accuracy than other supervised and reinforcement learning baselines. Expand
...
1
2
3
4
5
...