Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor

@inproceedings{Wang2021StructuralKD,
  title={Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor},
  author={Xinyu Wang and Yong-jia Jiang and Zhaohui Yan and Zixia Jia and Nguyen Bach and Tao Wang and Zhongqiang Huang and Fei Huang and Kewei Tu},
  booktitle={ACL},
  year={2021}
}
Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student’s output distributions. However, for structured prediction problems, the output space is exponential in size; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this… 
Efficient Sub-structured Knowledge Distillation
TLDR
This work proposes an approach that is much simpler in its formulation and far more efficient for training than existing approaches, which transfers the knowledge from a teacher model to its student model by locally matching their predictions on all sub-structures, instead of the whole output space.
Language Modelling via Learning to Rank
TLDR
It is shown that rank-based KD generally improves perplexity (PPL) — often with statistical significance — when compared to Kullback–Leiblerbased KD and that this can be done without the use of a pre-trained LM.
Automated Concatenation of Embeddings for Structured Prediction
TLDR
This paper proposes Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks, based on a formulation inspired by recent progress on neural architecture search.
Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning
TLDR
This paper finds empirically that the contextual representations computed on the retrieval-based input view, constructed through the concatenation of a sentence and its external contexts, can achieve significantly improved performance compared to the original input view based only on the sentence.

References

SHOWING 1-10 OF 46 REFERENCES
Knowledge Distillation for Sequence Model
Knowledge distillation, or teacher-student training, has been effectively used to improve the performance of a relatively simpler deep learning model (the student) using a more complex model (the
XtremeDistil: Multi-stage Distillation for Massive Multilingual Models
TLDR
This work studies knowledge distillation with a focus on multilingual Named Entity Recognition (NER) and proposes a stage-wise optimization scheme leveraging teacher internal representations, that is agnostic of teacher architecture, and shows that it outperforms strategies employed in prior works.
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.
Structure-Level Knowledge Distillation For Multilingual Sequence Labeling
TLDR
This paper proposes two novel KD methods based on structure-level information that approximately minimizes the distance between the student’s and the teachers’ structure- level probability distributions, and aggregates theructure-level knowledge to local distributions and minimizesThe distance between two local probability distributions.
Distilling Neural Networks for Greener and Faster Dependency Parsing
TLDR
This work uses teacher-student distillation to improve the efficiency of the Biaffine dependency parser which obtains state-of-the-art performance with respect to accuracy and parsing speed and achieves a parser which is not only faster but also more accurate than the fastest modern parser on the Penn Treebank.
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
TLDR
This paper proposes to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM, as well as its siamese counterpart for sentence-pair tasks, and achieves comparable results with ELMo.
Efficient Second-Order TreeCRF for Neural Dependency Parsing
TLDR
This paper presents a second-order TreeCRF extension to the biaffine parser, and proposes an effective way to batchify the inside and Viterbi algorithms for direct large matrix operation on GPUs, and to avoid the complex outside algorithm via efficient back-propagation.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Automated Concatenation of Embeddings for Structured Prediction
TLDR
This paper proposes Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks, based on a formulation inspired by recent progress on neural architecture search.
Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
TLDR
This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing.
...
...