• Corpus ID: 244909258

Causal Distillation for Language Models

@article{Wu2021CausalDF,
  title={Causal Distillation for Language Models},
  author={Zhengxuan Wu and Atticus Geiger and Josh Rozner and Elisa Kreiss and Hanson Lu and Thomas F. Icard and Christopher Potts and Noah D. Goodman},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.02505}
}
Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 29 REFERENCES
Patient Knowledge Distillation for BERT Model Compression
TLDR
This work proposes a Patient Knowledge Distillation approach to compress an original large model (teacher) into an equally-effective lightweight shallow network (student), which translates into improved results on multiple NLP tasks with a significant gain in training efficiency, without sacrificing model accuracy.
Inducing Causal Structure for Interpretable Neural Networks
TLDR
The new method of interchange intervention training (IIT) is presented, which align variables in the causal model with representations in the neural model and trains a neural model to match the counterfactual behavior of the causalmodel on a base input when aligned representations in both models are set to be the value they would be for a second source input.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Model Compression for Domain Adaptation through Causal Effect Estimation
TLDR
This work proposes an ATE-guided Model Compression scheme (AMoC), which generates many model candidates, differing by the model components that were removed, and selects the best candidate through a stepwise regression model that utilizes the ATE to predict the expected performance on the target domain.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Causal Abstractions of Neural Networks
TLDR
It is discovered that a BERT-based model with state-of-the-art performance successfully realizes parts of the natural logic model’s causal structure, whereas a simpler baseline model fails to show any such structure, demonstrating that BERT representations encode the compositional structure of MQNLI.
Universal Transformers
TLDR
The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed.
Distilling the Knowledge in a Neural Network
TLDR
This work shows that it can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model and introduces a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse.
A Tensorized Transformer for Language Modeling
TLDR
A novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD) with tensor train decomposition is proposed, which can not only largely compress the model parameters but also obtain performance improvements.
...
1
2
3
...