GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

  title={GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
  author={Alex Wang and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman},
For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. [] Key Method We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately…

Figures and Tables from this paper

An Interpretability Evaluation Benchmark for Pre-trained Language Models

A novel evaluation benchmark providing with both English and Chinese annotated data that tests LMs abilities in multiple dimensions, i.e., grammar, semantics, knowledge, reasoning and computation, and contains perturbed instances for each original instance so as to use the rationale consistency under perturbations as the metric for faithfulness, a perspective of interpretability.

Effects and Mitigation of Out-of-vocabulary in Universal Language Models

The adverse effects of OOV is demonstrated in the context of transfer learning in CJK languages, then a novel approach is proposed to maximize the utility of a pre-trained model suffering from OOV.

GLGE: A New General Language Generation Evaluation Benchmark

The General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks, is presented and a leaderboard with strong baselines including MASS, BART, and ProphetNet is built.

On the Role of Corpus Ordering in Language Modeling

Empirical results of training transformer language models on English corpus and evaluating it intrinsically as well as after fine-tuning across eight tasks from the GLUE benchmark, show consistent improvement gains over conventional vanilla training.

CLUE: A Chinese Language Understanding Evaluation Benchmark

The first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark is introduced, an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text.

Generalizing Natural Language Analysis through Span-relation Representations

This paper provides the simple insight that a great variety of tasks can be represented in a single unified format consisting of labeling spans and relations between spans, thus a single task-independent model can be used across different tasks.

SILT: Efficient transformer training for inter-lingual inference

Coarse-to-Fine: Hierarchical Multi-task Learning for Natural Language Understanding

A hierarchical framework with a coarse-to-fine paradigm, with the bottom level shared to all the tasks, the mid-level divided to different groups, and the top-level assigned to each of the tasks is proposed, which allows the model to learn basic language properties from all tasks, boost performance on relevant tasks, and reduce the negative impact from irrelevant tasks.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Knowledge Graph Fusion for Language Model Fine-tuning

Evidence is shown that, given the appropriate task, modest injection with relevant, high-quality knowledge is most performant, and this work investigates the benefits of knowledge incorporation into the tuning stages of BERT.



Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference

This paper describes a model (alpha) that is ranked among the top in the Shared Task, on both the in- domain test set and on the cross-domain test set, demonstrating that the model generalizes well to theCross-domain data.

A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

A joint many-task model together with a strategy for successively growing its depth to solve increasingly complex tasks and uses a simple regularization term to allow for optimizing all model weights to improve one task’s loss without exhibiting catastrophic interference of the other tasks.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

Sluice networks: Learning what to share between loosely related tasks

Sluice Networks is introduced, a general framework for multi-task learning where trainable parameters control the amount of sharing and it is shown that a) label entropy is predictive of gains in sluice networks, confirming findings for hard parameter sharing and b) while slUice networks easily fit noise, they are robust across domains in practice.

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.

A large annotated corpus for learning natural language inference

The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.

One billion word benchmark for measuring progress in statistical language modeling

A new benchmark corpus to be used for measuring progress in statistical language modeling, with almost one billion words of training data, is proposed, which is useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques.

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

This work presents a simple, effective multi-task learning framework for sentence representations that combines the inductive biases of diverse training objectives in a single model and demonstrates that sharing a single recurrent sentence encoder across weakly related tasks leads to consistent improvements over previous methods.

DisSent: Sentence Representation Learning from Explicit Discourse Relations

It is demonstrated that the automatically curated corpus allows a bidirectional LSTM sentence encoder to yield high quality sentence embeddings and can serve as a supervised fine-tuning dataset for larger models such as BERT.

Reasoning about Entailment with Neural Attention

This paper proposes a neural model that reads two sentences to determine entailment using long short-term memory units and extends this model with a word-by-word neural attention mechanism that encourages reasoning over entailments of pairs of words and phrases, and presents a qualitative analysis of attention weights produced by this model.