• Corpus ID: 236034208

Representation Consolidation for Training Expert Students

  title={Representation Consolidation for Training Expert Students},
  author={Zhizhong Li and Avinash Ravichandran and Charless C. Fowlkes and Marzia Polito and Rahul Bhotika and Stefano Soatto},
Traditionally, distillation has been used to train a student model to emulate the input/output functionality of a teacher. A more useful goal than emulation, yet under-explored, is for the student to learn feature representations that transfer well to future tasks. However, we observe that standard distillation of task-specific teachers actually reduces the transferability of student representations to downstream tasks. We show that a multi-head, multi-task distillation method using an… 

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

This work investigates an alternative strategy for pre-training, namely Knowledge Distillation as Efficient Pre-training ( KDEP), aim-ing to efficiently transfer the learned feature representation from existing pre-trained models to new student models for future downstream tasks.

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

It is demonstrated that jointly learning from heterogeneous tasks and multiple data sources contributes to universal visual representation, leading to better transferring results of various downstream tasks, including classification, object detection and semantic segmentation.

INTERN: A New Learning Paradigm Towards General Vision

A new learning paradigm named INTERN is developed, which introduces a new data system, a new architecture, and a new benchmark, which form a general vision ecosystem to support its future development in an open and inclusive manner.



Contrastive Representation Distillation

The resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer.

Scalable Transfer Learning with Expert Models

This work trains a diverse set of experts by exploiting existing label structures, and uses cheap-to-compute performance proxies to select the relevant expert for each target task, and provides an adapter-based architecture able to compress many experts into a single model.

Knowledge Distillation from Internal Representations

This paper proposes to distill the internal representations of a large model such as BERT into a simplified version of it, and formulate two ways todistill such representations and various algorithms to conduct the distillation.

Feature-Level Ensemble Knowledge Distillation for Aggregating Knowledge from Multiple Networks

A versatile and powerful training algorithm named FEature-level Ensemble knowledge Distillation (FEED), which aims to transfer the ensemble knowledge using multiple teacher networks, and introduces a couple of training algorithms that transfer ensemble knowledge to the student at the feature-map-level.

Knowledge Flow: Improve Upon Your Teachers

This paper develops knowledge flow which moves 'knowledge' from multiple deep nets, referred to as teachers, to a new deep net model, called the student, and demonstrates the approach on a variety of supervised and reinforcement learning tasks, outperforming fine-tuning and other 'knowledge exchange' methods.

Zero-Shot Knowledge Distillation in Deep Networks

This paper synthesizes the Data Impressions from the complex Teacher model and utilize these as surrogates for the original training data samples to transfer its learning to Student via knowledge distillation, and shows that this framework results in competitive generalization performance as achieved by distillation using the actualTraining data samples on multiple benchmark datasets.

Large-Scale Generative Data-Free Distillation

This work proposes a new method to train a generative image model by leveraging the intrinsic normalization layers' statistics of the trained teacher network, which enables an ensemble of generators without training data that can efficiently produce substitute inputs for subsequent distillation.

GDumb: A Simple Approach that Questions Our Progress in Continual Learning

We discuss a general formulation for the Continual Learning (CL) problem for classification—a learning task where a stream provides samples to a learner and the goal of the learner, depending on the

Transfer Learning by Adaptive Merging of Multiple Models

The proposed T-IMM (Transfer Incremental Mode Matching) is a method to leverage several pre-trained models, which extends the concept of Incremental mode Matching from lifelong learning to the transfer learning setting and introduces layer wise mixing ratios, which are learned automatically and fuse multiple pre- trained models before fine-tuning on the new task.

Zero-shot Knowledge Transfer via Adversarial Belief Matching

A novel method which trains a student to match the predictions of its teacher without using any data or metadata is proposed, and a metric is proposed to quantify the degree of belief matching between teacher and student in the vicinity of decision boundaries.