• Corpus ID: 244130382

INTERN: A New Learning Paradigm Towards General Vision

  title={INTERN: A New Learning Paradigm Towards General Vision},
  author={Jing Shao and Siyu Chen and Yangguang Li and Kun Wang and Zhen-fei Yin and Yinan He and Jianing Teng and Qinghong Sun and Mengya Gao and Jihao Liu and Gengshi Huang and Guanglu Song and Yichao Wu and Yuming Huang and Fenggang Liu and Huan Peng and Shuo Qin and Chengyu Wang and Yujie Wang and Conghui He and Ding Liang and Yu Liu and Fengwei Yu and Junjie Yan and Dahua Lin and Xiaogang Wang and Y. Qiao},
Enormous waves of technological innovations over the past several years, marked by the advances in AI technologies, are profoundly reshaping the industry and the society. However, down the road, a key challenge awaits us, that is, our capability of meeting rapidly-growing scenario-specific demands is severely limited by the cost of acquiring the commensurate amount of training data. This difficult situation is in essence due to limitations of the mainstream learning paradigm: we need to train a… 

Figures and Tables from this paper

Benchmarking Omni-Vision Representation through the Lens of Visual Realms

This paper proposes Omni-Realm Benchmark, a new supervised contrastive learning framework, namely Re lational Co ntrastive learning ( ReCo), for a better omni-vision representation, and illustrates the superior of ReCo to other supervised Contrastive learning methods, and reveals multiple practical observations to facilitate future research.

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Experiments on multiple generative and discriminative vision-language tasks, including Visual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval, Text-Image Retrival, and Image Classification, demonstrate the effectiveness and versatility of the proposed UPop framework.

Million-scale Object Detection with Large Vision Model


SEPT: Towards Scalable and Efficient Visual Pre-Training

Results on various down- stream tasks demonstrate that SEPT can achieve competitive or even better performance compared with ImageNet pre- training while reducing the size of training samples by one magnitude without resorting to any extra annotations.

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.

Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation

This work presents a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains — Perceive, Ground, Reason, and Act, and observes that Transformer-based visual backbone generally outperforms CNN-based backbone on G-V UE.

Towards Grand Unification of Object Tracking

For the first time, the great unification of the tracking network architecture and learning paradigm is accomplished, with Unicorn, a unified method that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

This work proposes CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants, and conducts a comprehensive analysis of three key factors: data, supervision, and model architecture.

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

This paper presents a Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction, and proposes a foveal window strategy, providing more diverse input patches with acceptable computational costs.



Taming Transformers for High-Resolution Image Synthesis

It is demonstrated how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images.

Objects365: A Large-Scale, High-Quality Dataset for Object Detection

  • Shuai ShaoZeming Li Jian Sun
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
Object365 can serve as a better feature learning dataset for localization-sensitive tasks like object detection and semantic segmentation and better generalization ability of Object365 has been verified on CityPersons, VOC segmentation, and ADE tasks.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.

Decoupled Weight Decay Regularization

This work proposes a simple modification to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function, and provides empirical evidence that this modification substantially improves Adam's generalization performance.

Scene Parsing through ADE20K Dataset

The ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts, is introduced and it is shown that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis.

WIDER FACE: A Face Detection Benchmark

There is a gap between current face detection performance and the real world requirements, and the WIDER FACE dataset, which is 10 times larger than existing datasets is introduced, which contains rich annotations, including occlusions, poses, event categories, and face bounding boxes.

Are we ready for autonomous driving? The KITTI vision benchmark suite

The autonomous driving platform is used to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection, revealing that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world.

Learning Multiple Layers of Features from Tiny Images

It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network.

ImageNet: A large-scale hierarchical image database

A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

Aligning Pretraining for Detection via Object-Level Contrastive Learning

This paper achieves state-of-the-art results for transfer performance on COCO detection using a Mask R-CNN framework and advocates a design principle which encourages alignment between the self-supervised pretext task and the downstream task.