Lifelong Language Knowledge Distillation

@inproceedings{Chuang2020LifelongLK,
  title={Lifelong Language Knowledge Distillation},
  author={Yung-Sung Chuang and Shang-Yu Su and Yun-Nung Chen},
  booktitle={EMNLP},
  year={2020}
}
It is challenging to perform lifelong language learning (LLL) on a stream of different tasks without any performance degradation comparing to the multi-task counterparts. To address this issue, we present Lifelong Language Knowledge Distillation (L2KD), a simple but efficient method that can be easily applied to existing LLL architectures in order to mitigate the degradation. Specifically, when the LLL model is trained on a new task, we assign a teacher model to first learn the new task, and… 
LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5
TLDR
This work proposes a framework called LFPT5, which takes full advantage of PT’s strong few-shot learning ability, and simultaneously trains the model as a task solver and a data generator, and can be applied to various different types of tasks and outperform previous methods in different LFLL settings.
RVAE-LAMOL: Residual Variational Autoencoder to Enhance Lifelong Language Learning
TLDR
The residual variational autoencoder (RVAE) is proposed to enhance LAMOL, a recent LLL model, by mapping different tasks into a limited unified semantic space and an identity task to make the model is discriminative to recognize the sample belonging to which task.
Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
TLDR
The authors' experiments show distillation-based approaches to be most effective in retaining downstream performance in earlier domains and improve knowledge transfer, allowing models to achieve better downstream performance over latest data, and improve temporal generalization when distribution gaps exist between training and evaluation because of time.
Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora
TLDR
The authors' experiments show distillation-based approaches to be most effective in retaining downstream performance in earlier domains and improve knowledge transfer, allowing models to achieve better downstream performance over latest data, and improve temporal generalization when distribution gaps exist between training and evaluation because of time.
Reminding the Incremental Language Model via Data-Free Self-Distillation
TLDR
The experimental results demonstrate that the proposed incremental language model via data-free self-distillation (DFSD) can exceed the previous state-of-the-art methods even if the maximum decrease in pseudo-data is 90%.
LFPT5: A U NIFIED F RAMEWORK FOR L IFELONG F EW - SHOT L ANGUAGE L EARNING B ASED ON P ROMPT T UNING OF T5
TLDR
This work defines this more challenging yet practical problem as Lifelong Few-shot Language Learning (LFLL) and proposes a unified framework for it based on prompt tuning (PT) of T5, called LFPT5, which takes full advantage of PT’s strong few-shot learning ability, and simultaneously trains the model as a task solver and a data generator.
Achieving Forgetting Prevention and Knowledge Transfer in Continual Learning
TLDR
A novel model called CTR is proposed to solve the problems of overcoming catastrophic forgetting and encouraging knowledge transfer across tasks in continuous learning, and the experimental results demonstrate the effectiveness of CTR.
Continual Sequence Generation with Adaptive Compositional Modules
TLDR
Experimental results show that the proposed continual sequence generation with adaptive compositional modules with pseudo experience replay can adaptively add modules or reuse modules based on task similarity, outperforming state-of-the-art baselines in terms of both performance and parameter efficiency.
Multi-Strategy Knowledge Distillation Based Teacher-Student Framework for Machine Reading Comprehension
TLDR
A multi-strategy Knowledge Distillation based Teacher-Student framework (MSKDTS) for ma-chine reading comprehension that can predict answer similar to the teacher model without being aware of which sentence is the corresponding evidence in the docu-ment.
Modifying Memories in Transformer Models
TLDR
This paper proposes a new task of explicitly modifying specific factual knowledge in Transformer models while ensuring the model performance does not degrade on the unmodified facts, and benchmarked several approaches that provide natural baseline performances on this task.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
LAMOL: LAnguage MOdeling for Lifelong Language Learning
TLDR
The results show that LAMOL prevents catastrophic forgetting without any sign of intransigence and can perform five very different language tasks sequentially with only one model.
BAM! Born-Again Multi-Task Networks for Natural Language Understanding
TLDR
This work proposes using knowledge distillation where single- task models teach a multi-task model, and enhances this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi- task model surpass its single-task teachers.
Lifelong Learning via Progressive Distillation and Retrospection
TLDR
A novel approach to lifelong learning is proposed, which tries to seek a better balance between preservation and adaptation via two techniques: Distillation and Retrospection.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Episodic Memory in Lifelong Language Learning
TLDR
This work proposes an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in a lifelong language learning setup where a model needs to learn from a stream of text examples without any dataset identifier.
Born Again Neural Networks
TLDR
This work studies KD from a new perspective: rather than compressing models, students are trained parameterized identically to their teachers, and shows significant advantages from transferring knowledge between DenseNets and ResNets in either direction.
Understanding Knowledge Distillation in Non-autoregressive Machine Translation
TLDR
It is found that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data, and a strong correlation is observed between the capacity of an NAT model and the optimal complexity of the distilled data for the best translation quality.
Lifelong GAN: Continual Learning for Conditional Image Generation
TLDR
A more generic framework for continual learning of generative models under different conditional image generation settings is proposed, and Lifelong GAN employs knowledge distillation to transfer learned knowledge from previous networks to the new network, making it possible to perform image-conditioned generation tasks in a lifelong learning setting.
Learning and Evaluating General Linguistic Intelligence
TLDR
This work analyzes state-of-the-art natural language understanding models and conducts an extensive empirical investigation to evaluate them against general linguistic intelligence criteria, and proposes a new evaluation metric based on an online encoding of the test data that quantifies how quickly an existing agent (model) learns a new task.
Get To The Point: Summarization with Pointer-Generator Networks
TLDR
A novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways, using a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.
...
...