Formulating Few-shot Fine-tuning Towards Language Model Pre-training: A Pilot Study on Named Entity Recognition

  title={Formulating Few-shot Fine-tuning Towards Language Model Pre-training: A Pilot Study on Named Entity Recognition},
  author={Zihan Wang and Kewen Zhao and Zilong Wang and Jingbo Shang},
Fine-tuning pre-trained language models has recently become a common practice in build-ing NLP models for various tasks, especially few-shot tasks. We argue that under the few-shot setting, formulating fine-tuning closer to the pre-training objectives shall be able to unleash more benefits from the pre-trained language models. In this work, we take few-shot named entity recognition (NER) for a pilot study, where existing fine-tuning strategies are much different from pre-training. We propose a… 

Figures and Tables from this paper


Few-Shot Named Entity Recognition: An Empirical Baseline Study
Three orthogonal schemes to improve model generalization ability in few-shot settings are investigated: meta-learning to construct prototypes for different entity types, task-specific supervised pre-training on noisy web data to extract entity-related representations and self-training to leverage unlabeled in-domain data.
Learning from Language Description: Low-shot Named Entity Recognition via Decomposed Framework
A novel NER framework is proposed, namely SpanNER, which learns from natural language supervision and enables the identification of never-seen entity classes without using in-domain labeled data.
Language Models are Few-Shot Learners
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Template-Based Named Entity Recognition Using BART
A template-based method for NER, treating NER as a language model ranking problem in a sequence-to-sequence framework, where original sentences and statement templates filled by candidate named entity span are regarded as the source sequence and the target sequence, respectively.
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.
Transformers: State-of-the-Art Natural Language Processing
Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Autoregressive Entity Retrieval
Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per article). The ability to retrieve such