Lifelong Language Knowledge Distillation

@inproceedings{Chuang2020LifelongLK,
  title={Lifelong Language Knowledge Distillation},
  author={Yung-Sung Chuang and Shang-Yu Su and Yun-Nung Chen},
  booktitle={EMNLP},
  year={2020}
}
It is challenging to perform lifelong language learning (LLL) on a stream of different tasks without any performance degradation comparing to the multi-task counterparts. To address this issue, we present Lifelong Language Knowledge Distillation (L2KD), a simple but efficient method that can be easily applied to existing LLL architectures in order to mitigate the degradation. Specifically, when the LLL model is trained on a new task, we assign a teacher model to first learn the new task, and… 

LFPT5: A Unified Framework for Lifelong Few-shot Language Learning Based on Prompt Tuning of T5

TLDR
This work proposes a framework called LFPT5, which takes full advantage of PT’s strong few-shot learning ability, and simultaneously trains the model as a task solver and a data generator, and can be applied to various different types of tasks and outperform previous methods in different LFLL settings.

RVAE-LAMOL: Residual Variational Autoencoder to Enhance Lifelong Language Learning

TLDR
The residual variational autoencoder (RVAE) is proposed to enhance LAMOL, a recent LLL model, by mapping different tasks into a limited unified semantic space and an identity task to make the model is discriminative to recognize the sample belonging to which task.

Ask Question First for Enhancing Lifelong Language Learning

TLDR
The Ask Question First and Replay Question is proposed, including a novel data format " BQCA " and a new training task to train pseudo questions of previous tasks, and experimental results demonstrate that AQF-RQ makes it easier for the model to generate more pseudo data that match corresponding tasks and is more robust to both sufficient and insufflcient pseudo-data when the task boundary is both clear and unclear.

Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

TLDR
The authors' experiments show distillation-based approaches to be most effective in retaining downstream performance in earlier domains and improve knowledge transfer, allowing models to achieve better downstream performance over latest data, and improve temporal generalization when distribution gaps exist between training and evaluation because of time.

Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora

TLDR
The authors' experiments show distillation-based approaches to be most effective in retaining downstream performance in earlier domains and improve knowledge transfer, allowing models to achieve better downstream performance over latest data, and improve temporal generalization when distribution gaps exist between training and evaluation because of time.

Reminding the Incremental Language Model via Data-Free Self-Distillation

TLDR
The experimental results demonstrate that the proposed incremental language model via data-free self-distillation (DFSD) can exceed the previous state-of-the-art methods even if the maximum decrease in pseudo-data is 90%.

LFPT5: A U NIFIED F RAMEWORK FOR L IFELONG F EW - SHOT L ANGUAGE L EARNING B ASED ON P ROMPT T UNING OF T5

TLDR
This work defines this more challenging yet practical problem as Lifelong Few-shot Language Learning (LFLL) and proposes a unified framework for it based on prompt tuning (PT) of T5, called LFPT5, which takes full advantage of PT’s strong few-shot learning ability, and simultaneously trains the model as a task solver and a data generator.

Achieving Forgetting Prevention and Knowledge Transfer in Continual Learning

TLDR
A novel model called CTR is proposed to solve the problems of overcoming catastrophic forgetting and encouraging knowledge transfer across tasks in continuous learning, and the experimental results demonstrate the effectiveness of CTR.

Continual Sequence Generation with Adaptive Compositional Modules

TLDR
Experimental results show that the proposed continual sequence generation with adaptive compositional modules with pseudo experience replay can adaptively add modules or reuse modules based on task similarity, outperforming state-of-the-art baselines in terms of both performance and parameter efficiency.

Multi-Strategy Knowledge Distillation Based Teacher-Student Framework for Machine Reading Comprehension

TLDR
A multi-strategy Knowledge Distillation based Teacher-Student framework (MSKDTS) for ma-chine reading comprehension that can predict answer similar to the teacher model without being aware of which sentence is the corresponding evidence in the docu-ment.

References

SHOWING 1-10 OF 32 REFERENCES

LAMOL: LAnguage MOdeling for Lifelong Language Learning

TLDR
The results show that LAMOL prevents catastrophic forgetting without any sign of intransigence and can perform five very different language tasks sequentially with only one model.

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

TLDR
This work proposes using knowledge distillation where single- task models teach a multi-task model, and enhances this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi- task model surpass its single-task teachers.

Lifelong Learning via Progressive Distillation and Retrospection

TLDR
A novel approach to lifelong learning is proposed, which tries to seek a better balance between preservation and adaptation via two techniques: Distillation and Retrospection.

Episodic Memory in Lifelong Language Learning

TLDR
This work proposes an episodic memory model that performs sparse experience replay and local adaptation to mitigate catastrophic forgetting in a lifelong language learning setup where a model needs to learn from a stream of text examples without any dataset identifier.

Born Again Neural Networks

TLDR
This work studies KD from a new perspective: rather than compressing models, students are trained parameterized identically to their teachers, and shows significant advantages from transferring knowledge between DenseNets and ResNets in either direction.

Lifelong GAN: Continual Learning for Conditional Image Generation

TLDR
A more generic framework for continual learning of generative models under different conditional image generation settings is proposed, and Lifelong GAN employs knowledge distillation to transfer learned knowledge from previous networks to the new network, making it possible to perform image-conditioned generation tasks in a lifelong learning setting.

Lifelong Domain Word Embedding via Meta-Learning

TLDR
A novel lifelong learning setting for domain embedding where the proposed meta-learner characterizes the similarities of the contexts of the same word in many domain corpora, which helps retrieve relevant data from the past domains to expand the new domain corpus.

Learning and Evaluating General Linguistic Intelligence

TLDR
This work analyzes state-of-the-art natural language understanding models and conducts an extensive empirical investigation to evaluate them against general linguistic intelligence criteria, and proposes a new evaluation metric based on an online encoding of the test data that quantifies how quickly an existing agent (model) learns a new task.

Get To The Point: Summarization with Pointer-Generator Networks

TLDR
A novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways, using a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator.

Non-Autoregressive Neural Machine Translation

TLDR
A model is introduced that avoids this autoregressive property and produces its outputs in parallel, allowing an order of magnitude lower latency during inference, and achieves near-state-of-the-art performance on WMT 2016 English-Romanian.