Corpus ID: 211258996

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

@article{Xu2020ImprovingBF,
  title={Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation},
  author={Yige Xu and Xipeng Qiu and Li-Gao Zhou and Xuanjing Huang},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.10345}
}
Fine-tuning pre-trained language models like BERT has become an effective way in NLP and yields state-of-the-art results on many downstream tasks. Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure, re-designing the pre-train tasks, and leveraging external data and knowledge. The fine-tuning strategy itself has yet to be fully explored. In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation… Expand
Self-boosting for Feature Distillation
Knowledge distillation is a simple but effective method for model compression, which obtains a better-performing small network (Student) by learning from a well-trained large network (Teacher).Expand
Bidirectional Language Modeling: A Systematic Literature Review
In transfer learning, two major activities, i.e., pretraining and fine-tuning, are carried out to perform downstream tasks. The advent of transformer architecture and bidirectional language models,Expand
Data Augmentation and Ensembling for FriendsQA
FriendsQA is a challenging QA dataset consisting of 10,610 questions, based on multi-person dialogues from the TV series Friends. We augmented its training data using back-translation, and proposed aExpand
Lijunyi at SemEval-2020 Task 4: An ALBERT Model Based Maximum Ensemble with Different Training Sizes and Depths for Commonsense Validation and Explanation
  • Junyi Li, Bin Wang, Haiyan Ding
  • Computer Science
  • SEMEVAL
  • 2020
TLDR
This article describes the system submitted to SemEval 2020 Task 4: Commonsense Validation and Explanation, which mainly used ALBERT model-based maximum ensemble with different training sizes and depths and proved the validity of the model to the task. Expand
Local-Global Knowledge Distillation in Heterogeneous Federated Learning with Non-IID Data
  • Dezhong Yao, Wanning Pan, +5 authors Lichao Sun
  • Computer Science
  • 2021
Federated learning enables multiple clients to collaboratively learn a global model by periodically aggregating the clients’ models without transferring the local data. However, due to theExpand
Self-Distillation for Few-Shot Image Captioning
  • Xianyu Chen, Ming Jiang, Qi Zhao
  • Computer Science
  • 2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2021
TLDR
An ensemble- based self-distillation method that allows image captioning models to be trained with unpaired images and captions and a simple yet effective pseudo feature generation method based on Gradient Descent is proposed. Expand
AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21
TLDR
This paper presents an Adversarial Training BERT method named AT-BERT, the winning solution to acronym identification task for Scientific Document Understanding (SDU) Challenge of AAAI 2021, which incorporates the FGM adversarial training strategy into the fine-tuning of BERT, which makes the model more robust and generalized. Expand
Multi-semantic granularity graphical model with ensemble BERTs and multi-staged training method for text classification
  • Wenfeng Shen, Qingwei Zeng, Wenjun Gu, Yang Xu
  • 2021
Text classification is a classic problem in Natural Language Processing (NLP). The task is to assign predefined categories to a given text sequence. In this paper we present a new Multi-semanticExpand
TransQuest at WMT2020: Sentence-Level Direct Assessment
TLDR
A simple QE framework based on cross-lingual transformers is introduced, and it is used to implement and evaluate two different neural architectures and achieves state-of-the-art results surpassing the results obtained by OpenKiwi, the baseline used in the shared task. Expand
Lee@HASOC2020: ALBERT-based Max Ensemble with Self-training for Identifying Hate Speech and Offensive Content in Indo-European Languages
TLDR
This paper proposes an ALBERT-based model, and uses the self-training and max ensemble to improve model performance, and achieves a macro F1 score of 0.4976 in subtask A. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 27 REFERENCES
Parameter-Efficient Transfer Learning for NLP
TLDR
To demonstrate adapter's effectiveness, the recently proposed BERT Transformer model is transferred to 26 diverse text classification tasks, including the GLUE benchmark, and adapter attain near state-of-the-art performance, whilst adding only a few parameters per task. Expand
BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning
TLDR
Using new adaptation modules, PALs or `projected attention layers', this work matches the performance of separately fine-tuned models on the GLUE benchmark with roughly 7 times fewer parameters, and obtains state-of-the-art results on the Recognizing Textual Entailment dataset. Expand
To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
TLDR
The empirical results across diverse NLP tasks with two state-of-the-art models show that the relative performance of fine-tuning vs. feature extraction depends on the similarity of the pretraining and target tasks. Expand
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
TLDR
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses. Expand
How to Fine-Tune BERT for Text Classification?
TLDR
A general solution for BERT fine-tuning is provided and new state-of-the-art results on eight widely-studied text classification datasets are obtained. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Universal Language Model Fine-tuning for Text Classification
TLDR
This work proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for fine- Tuning a language model. Expand
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD. Expand
Multi-Task Deep Neural Networks for Natural Language Understanding
TLDR
A Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks that allows domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations. Expand
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation. Expand
...
1
2
3
...