Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

@article{Xia2022PromptingEF,
  title={Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models},
  author={M. Xia and Mikel Artetxe and Jingfei Du and Danqi Chen and Ves Stoyanov},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.15223}
}
Pre-trained masked language models success-fully perform few-shot learning by formulat-ing downstream tasks as text infilling. How-ever, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a token is generated or original. We naturally extend… 

ELECTRA is a Zero-Shot Learner, Too

Experimental results show that ELECTRA model based on RTD-prompt learning achieves surprisingly state-of-the-art zero- shot performance, and compared to the pre-trained masked language models, the pretrained replaced token detection model performs better in zero-shot learning.

A Universal Discriminator for Zero-Shot Generalization

This work challenges this convention by showing that discriminative approaches perform substantially better than generative ones on a large number of NLP tasks, and jointly train a generalized UD in combination with generative tasks, which maintains its advantage on discrim inative tasks and simultaneously works onGenerative tasks.

Discriminative Language Model as Semantic Consistency Scorer for Prompt-based Few-Shot Text Classification

This paper proposes a novel prompt-based finetuning method (called DLMSCS) for few-shot text classification by utilizing the discriminative language model ELECTRA that is pretrained to distinguish

Cold-Start Data Selection for Few-shot Language Model Fine-tuning: A Prompt-Based Uncertainty Propagation Approach

A prompt-based uncertainty propagation approach to estimate the importance of data points and a partition-then-rewrite (P TR) strategy to promote sample diversity when querying for annotations are designed.

YATO: Yet Another deep learning based Text analysis Open toolkit

This work introduces YATO, an open-source toolkit for text analysis with deep learning that focuses on fundamental sequence labeling and sequence classification tasks on text and can facilitate reproducing and re-naming state-of-the-art NLP models, and promote the cross-disciplinary applications of NLP techniques.

ArT: All-round Thinker for Unsupervised Commonsense Question Answering

This work proposes an approach of All-round Thinker (ArT), a model that shows its brilliant performance and outperforms previous advanced unsupervised models on all scales of PLM backbones by fully taking association during knowledge generating.

References

SHOWING 1-10 OF 32 REFERENCES

Pre-trained Token-replaced Detection Model as Few-shot Learner

This paper proposes a novel approach to few-shot learning with pre-trained token-replaced detection models like ELECTRA, and demonstrates that this approach outperforms few- shot learners withPre-trained masked language models in both one-sentence and two- sentence learning tasks.

Prompt Tuning for Discriminative Pre-trained Language Models

DPT is presented, the first prompt tuning framework for discriminative PLMs, which reformulates NLP tasks into a discrim inative language modeling problem, and achieves significantly higher performance, and also prevents the unstable problem in tuning large PLMs in both full-set and low-resource settings.

Making Pre-trained Language Models Better Few-shot Learners

The LM-BFF approach makes minimal assumptions on task resources and domain expertise, and hence constitutes a strong task-agnostic method for few-shot learning.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute.

Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference

This work introduces Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Efficient Large Scale Language Modeling with Mixtures of Experts

This paper presents a de-tailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot tuning.

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

This work shows that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller, and identifies key factors required for successful natural language understanding with small language models.

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

This work investigates how the performance of the best-found model varies as a function of the number of fine-tuning trials, and examines two factors influenced by the choice of random seed: weight initialization and training data order.