Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression

@article{Xu2021BeyondPA,
  title={Beyond Preserved Accuracy: Evaluating Loyalty and Robustness of BERT Compression},
  author={Canwen Xu and Wangchunshu Zhou and Tao Ge and Kelvin J. Xu and Julian McAuley and Furu Wei},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.03228}
}
Recent studies on compression of pretrained language models (e.g., BERT) usually use preserved accuracy as the metric for evaluation. In this paper, we propose two new metrics, label loyalty and probability loyalty that measure how closely a compressed model (i.e., student) mimics the original model (i.e., teacher). We also explore the effect of compression with regard to robustness under adversarial attacks. We benchmark quantization, pruning, knowledge distillation and progressive module… 

Figures and Tables from this paper

What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression

Study of two popular model compression techniques including knowledge distillation and pruning show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets although they obtain similar performance on in-distribution development sets for a task.

Robust Lottery Tickets for Pre-trained Language Models

This work proposes a novel method based on learning binary weight masks to identify robust tickets hidden in the original PLMs, and designs an adversarial loss objective to guide the search for robust tickets and ensure that the tickets perform well both in accuracy and robustness.

Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models

Optimizing for minima consistently leads to greater compressibility of parameters compared to standard Adam optimization when BERT models are tuned, leading to higher rates of compression with little to no loss in accuracy on the GLUE classification benchmark.

Intriguing Properties of Compression on Multilingual Models

This work proposes an experimental framework to char-acterize the impact of sparsifying multilingual pre-trained language models duringuning and observes that under certain sparsi-tion regimes compression may aid, rather than disproportionately impact the performance of low-resource languages.

A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models

This paper conducts large-scale experiments with the pre-trained BERT model on three natural language understanding (NLU) tasks, showing that BERT does contain sparse and robust subnetworks (SRNets) within certain sparsity constraint and exploring the upper bound of SRNets by making use of the OOD information, which reveals that there exist sparse and almost unbiased subnets.

Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering

This paper investigates whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnet- works and shows that there indeed exist sparse and ro- bust LXMERT subnetworks, which significantly outperform the full model (without debiasing) with much fewer parame- ters.

Pruning has a disparate impact on model accuracy

Light is shed on the factors to cause disparities in gradient norms and distance to decision boundary across groups to be responsible for this critical issue, and a simple solution is proposed that mitigates the disparate impacts caused by pruning.

Feature Structure Distillation for BERT Transferring

To transfer feature structure distillation methods based on the Centered Kernel Alignment, which assigns a consistent value to similar features structures and reveals more informative relations, a memory-augmented transfer method with clustering is implemented for the global structures.

Feature Structure Distillation with Centered Kernel Alignment in BERT Transferring

To transfer feature structure distillation methods based on the Centered Kernel Alignment, which assigns a consistent value to similar features structures and reveals more informative relations, a memory-augmented transfer method with clustering is implemented for the global structures.

Robust Distillation for Worst-class Performance

Theoretically, this work provides insights into what makes a good teacher when the goal is to train a robust student and empirically shows that robust distillation techniques not only achieve better worst-class performance, but also lead to Pareto improvement in the tradeoff between overall performance and worst- class performance compared to other baseline methods.

References

SHOWING 1-10 OF 46 REFERENCES

Characterising Bias in Compressed Models

This work proposes its use as a human-in-the-loop auditing tool to surface a tractable subset of the dataset for further inspection or annotation by a domain expert and establishes that for CIE examples, compression amplifies existing algorithmic bias.

Is Robustness the Cost of Accuracy? - A Comprehensive Study on the Robustness of 18 Deep Image Classification Models

This paper thoroughly benchmark 18 ImageNet models using multiple robustness metrics, including the distortion, success rate and transferability of adversarial examples between 306 pairs of models, and reveals several new insights.

Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment

The TextFooler is presented, a general attack framework, to generate natural adversarial texts that outperforms state-of-the-art attacks in terms of success rate and perturbation rate.

BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

This paper proposes a novel model compression approach to effectively compress BERT by progressive module replacing, which outperforms existing knowledge distillation approaches on GLUE benchmark, showing a new perspective of model compression.

Adversarial Robustness through Regularization: A Second-Order Approach

The proposed second-order adversarial regularizer (SOAR) is an upper bound based on the Taylor approximation of the inner-max in the robust optimization objective that improves the robustness of networks on the CIFAR-10 dataset.

Early Exiting BERT for Efficient Document Ranking

Early exiting BERT is introduced for document ranking with a slight modification, BERT becomes a model with multiple output paths, and each inference sample can exit early from these paths, so computation can be effectively allocated among samples.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

This work presents a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation, and demonstrates that the monolingual model outperforms state-of-the-art baselines in different parameter size of student models.

RoBERTa: A Robustly Optimized BERT Pretraining Approach

It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.

Detecting Overfitting via Adversarial Examples

A new hypothesis test is proposed that uses only the original test data to detect overfitting, and utilizes a new unbiased error estimate that is based on adversarial examples generated from the test data and importance weighting.