Corpus ID: 143424870

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

@inproceedings{Wang2019SuperGLUEAS,
  title={SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems},
  author={Alex Wang and Yada Pruksachatkun and Nikita Nangia and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman},
  booktitle={NeurIPS},
  year={2019}
}
In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks. The GLUE benchmark, introduced a little over one year ago, offers a single-number metric that summarizes progress on a diverse set of such tasks, but performance on the benchmark has recently surpassed the level of non-expert humans, suggesting limited headroom for further research. In this paper we present SuperGLUE, a new… Expand
GLGE: A New General Language Generation Evaluation Benchmark
TLDR
The General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks, is presented and a leaderboard with strong baselines including MASS, BART, and ProphetNet is built. Expand
MOROCCO: Model Resource Comparison Framework
TLDR
This work presents MOROCCO, a framework to compare language models compatible with jiant environment which supports over 50 NLU tasks, including SuperGLUE benchmark and multiple probing suites, and demonstrates its applicability for two GLUE-like suites in different languages. Expand
The Sensitivity of Language Models and Humans to Winograd Schema Perturbations
TLDR
Results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones. Expand
WaLDORf: Wasteless Language-model Distillation On Reading-comprehension
TLDR
A novel set of techniques are proposed which together produce a task-specific hybrid convolutional and transformer model, WaLDORf, that achieves state-of-the-art inference speed while still being more accurate than previous distilled models. Expand
Structural analysis of an all-purpose question answering model
TLDR
It is observed that attention heads specialize in a particular task and that some heads are more conducive to learning than others in both the multi-task and single-task settings. Expand
Multi-task learning for natural language processing in the 2020s: where are we going?
TLDR
This paper strives to provide a comprehensive survey of the numerous recent MTL contributions to the field of natural language processing and provide a forum to focus efforts on the hardest unsolved problems in the next decade. Expand
RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark
TLDR
This paper introduces an advanced Russian general language understanding evaluation benchmark – Russian SuperGLUE and presents the first results of comparing multilingual models in the translated diagnostic test set and offers the first steps to further expanding or assessing State-of theart models independently of language. Expand
DIET: Lightweight Language Understanding for Dialogue Systems
Large-scale pre-trained language models have shown impressive results on language understanding benchmarks like GLUE and SuperGLUE, improving considerably over other pre-training methods likeExpand
Muppet: Massive Multi-task Representations with Pre-Finetuning
TLDR
It is shown that prefinetuning consistently improves performance for pretrained discriminators and generation models on a wide range of tasks, while also significantly improving sample efficiency during fine-tuning. Expand
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
TLDR
This position paper describes and critiques the Pretraining-Agnostic Identically Distributed (PAID) evaluation paradigm, and advocates for supplementing or replacing PAID with paradigms that reward architectures that generalize as quickly and robustly as humans. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 84 REFERENCES
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Expand
BAM! Born-Again Multi-Task Networks for Natural Language Understanding
TLDR
This work proposes using knowledge distillation where single- task models teach a multi-task model, and enhances this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi- task model surpass its single-task teachers. Expand
Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark
TLDR
It is concluded that low-resource sentence classification remains a challenge for modern neural network approaches to text understanding using the BERT model in limited-data regimes. Expand
Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks
TLDR
The benefits of supplementary training with further training on data-rich supervised tasks, such as natural language inference, obtain additional performance improvements on the GLUE benchmark, as well as observing reduced variance across random restarts in this setting. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Multi-Task Deep Neural Networks for Natural Language Understanding
TLDR
A Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks that allows domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations. Expand
Linguistic Knowledge and Transferability of Contextual Representations
TLDR
It is found that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge. Expand
Multilingual Constituency Parsing with Self-Attention and Pre-Training
TLDR
It is shown that constituency parsing benefits from unsupervised pre-training across a variety of languages and a range of pre- training conditions, and the idea of joint fine-tuning is explored and shows that it gives low-resource languages a way to benefit from the larger datasets of other languages. Expand
AllenNLP: A Deep Semantic Natural Language Processing Platform
TLDR
AllenNLP is designed to support researchers who want to build novel language understanding models quickly and easily and provides a flexible data API that handles intelligent batching and padding, and a modular and extensible experiment framework that makes doing good science easy. Expand
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
TLDR
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Expand
...
1
2
3
4
5
...