Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models

  title={Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models},
  author={Clara Na and Sanket Vaibhav Mehta and Emma Strubell},
Model compression by way of parameter pruning, quantization, or distillation has recently gained popularity as an approach for reducing the computational requirements of modern deep neural network models for NLP. Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models that is compatible with a wide variety of contemporary off-the-shelf hardware (unlike quantization), and that requires little additional training (unlike distillation). Pruning… 

Figures and Tables from this paper

Model Generalization: A Sharpness Aware Optimization Perspective

Three experiments show that sharpness aware-based optimization techniques could help to provide models with strong generalization ability and show that ASAM could improve the generalization performance on un-normalized data.

A Fair Comparison of Two Popular Flat Minima Optimizers: Stochastic Weight Averaging vs. Sharpness-Aware Minimization

A number of surprising results are discovered from a broad benchmarking across computer vision, natural language processing, and graph representation learning tasks that will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.

Sharpness-aware Quantization for Deep Neural Networks

Extensive experiments on both convolutional neural networks and Transformers across various datasets show that SAQ improves the generalization performance of the quantized models, yielding the SOTA results in uniform quantization.



Structured Pruning Learns Compact and Accurate Models

This work proposes a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning), which delivers highly parallelizable subnetworks and matches the distillation methods in both accuracy and latency, without resorting to any unlabeled data.

An Empirical Investigation of the Role of Pre-training in Lifelong Learning

This work investigates existing methods in the context of large, pre-trained models and evaluates their performance on a variety of text and image classification tasks, and proposes jointly optimizing for current task loss and loss basin sharpness in order to explicitly encourage wider basins during sequential fine-tuning.

Sharpness-Aware Minimization for Efficiently Improving Generalization

This work introduces a novel, effective procedure for simultaneously minimizing loss value and loss sharpness, Sharpness-Aware Minimization (SAM), which improves model generalization across a variety of benchmark datasets and models, yielding novel state-of-the-art performance for several.

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

This work finds matching subnetworks at 40% to 90% sparsity in BERT models at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Model compression

This work presents a method for "compressing" large, complex ensembles into smaller, faster models, usually without significant loss in performance.

Optimal Brain Damage

A class of practical and nearly optimal schemes for adapting the size of a neural network by using second-derivative information to make a tradeoff between network complexity and training set error is derived.

Label Noise SGD Provably Prefers Flat Global Minimizers

This analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

TinyBERT: Distilling BERT for Natural Language Understanding

A novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models is proposed and, by leveraging this new KD method, the plenty of knowledge encoded in a large “teacher” BERT can be effectively transferred to a small “student” TinyBERT.