• Corpus ID: 244709355

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

  title={VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition},
  author={Changyao Tian and Wenhai Wang and Xizhou Zhu and Xiaogang Wang and Jifeng Dai and Y. Qiao},
Deep learning-based models encounter challenges when processing long-tailed data in the real world. Existing solutions usually employ some balancing strategies or transfer learning to deal with the class imbalance problem, based on the image modality. In this work, we present a visuallinguistic long-tailed recognition framework, termed VLLTR, and conduct empirical studies on the benefits of introducing text modality for long-tailed recognition (LTR). Compared to existing approaches, the… 


Decoupling Representation and Classifier for Long-Tailed Recognition
It is shown that it is possible to outperform carefully designed losses, sampling strategies, even complex modules with memory, by using a straightforward approach that decouples representation and classification.
Self Supervision to Distillation for Long-Tailed Visual Recognition
It is shown that soft label can serve as a powerful solution to incorporate label correlation into a multi-stage training scheme for long-tailed recognition, as well as a new distillation label generation module guided by self-supervision.
Learning Transferable Visual Models From Natural Language Supervision
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
BBN: Bilateral-Branch Network With Cumulative Learning for Long-Tailed Visual Recognition
A unified Bilateral-Branch Network (BBN) is proposed to take care of both representation learning and classifier learning simultaneously, where each branch does perform its own duty separately.
Inflated Episodic Memory With Region Self-Attention for Long-Tailed Visual Recognition
  • Linchao Zhu, Yi Yang
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
An Inflated Episodic Memory (IEM) is introduced for long-tailed visual recognition to deal with the class imbalance problem and a novel region self-attention mechanism for multi-scale spatial feature map encoding is introduced.
Distributional Robustness Loss for Long-tail Learning
This work proposes a new loss based on robustness theory, which encourages the model to learn high-quality representations for both head and tail classes and finds that training with robustness increases recognition accuracy of tail classes while largely maintaining the accuracy of head classes.
Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition From a Domain Adaptation Perspective
This work connects existing class-balanced methods for long-tailed classification to target shift to reveal that these methods implicitly assume that the training data and test data share the same class-conditioned distribution, which does not hold in general and especially for the tail classes.
VinVL: Revisiting Visual Representations in Vision-Language Models
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
Equalization Loss for Long-Tailed Object Recognition
This work proposes a simple but effective loss, named equalization loss, to tackle the problem of long-tailed rare categories by simply ignoring those gradients for rare categories, and wins the 1st place in the LVIS Challenge 2019.
ResLT: Residual Learning for Long-tailed Recognition
This work designs the effective residual fusion mechanism -- with one main branch optimized to recognize images from all classes, another two residual branches are gradually fused and optimized to enhance images from medium+tail classes and tail classes respectively.