• Corpus ID: 246016339

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

@inproceedings{Liu2022WANLIWA,
  title={WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation},
  author={Alisa Liu and Swabha Swayamdipta and Noah A. Smith and Yejin Choi},
  year={2022}
}
A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We introduce a novel paradigm for dataset creation based on human and machine collaboration , which brings together the generative strength of language models and the eval-uative strength of humans. Starting with an existing dataset, MultiNLI, our approach uses dataset cartography to automatically identify… 
FLUTE: Figurative Language Understanding and Textual Explanations
TLDR
FLUTE is released, a dataset of 8,000 figurative NLI instances with explanations, spanning three cat-egories: Sarcasm, Simile, and Metaphor, and it is shown how uti-lizing GPT-3 in conjunction with human experts can aid in scaling up the creation of datasets even for such complex linguistic phe-nomena as flgurative language.
Reframing Human-AI Collaboration for Generating Free-Text Explanations
TLDR
A pipeline that combines GPT-3 with a supervised that incorpo-rates binary acceptability judgments from humans in the loop is created and it is demonstrated that acceptability is partially correlated with various fine-grained attributes of explanations.
InPars: Data Augmentation for Information Retrieval using Large Language Models
TLDR
This work harnesses the fewshot capabilities of large pretrained language models as synthetic data generators for IR tasks and shows that models finetuned solely on the authors' unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed selfsupervised dense retrieval methods.
Active Programming by Example with a Natural Language Prior
TLDR
APEL, a new framework that enables non-programmers to indirectly annotate natural language utterances with executable meaning representations, such as SQL programs, is introduced, to reduce effort required from annotators and synthesize simple input databases that nonetheless have high information gain.
ZeroGen: Efficient Zero-shot Learning via Dataset Generation
TLDR
It is argued that ZEROGEN can also provide useful insights from the perspective of data016 free model-agnostic knowledge distillation, 017 and unreferenced text generation evaluation.
Uncertainty Estimation for Language Reward Models
TLDR
It is found that in this setting ensemble active learning does not outperform random sampling, and current pre-training methods will need to be modified to support uncertainty estimation, e.g. by training multiple language models.
Teaching language models to support answers with verified quotes
TLDR
This work uses reinforcement learning from human preferences to train “open-book” QA models that generate answers whilst also citing specific evidence for their claims, which aids in the appraisal of correctness.
Can Foundation Models Wrangle Your Data?
TLDR
It is found that large FMs generalize and achieve SoTA performance on data cleaning and integration tasks, even though they are not trained for these data tasks.
Self-critiquing models for assisting human evaluators
TLDR
This work fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning, and suggests that even large models may still have relevant knowledge they cannot or do not articulate as critiques with both topic-based summarization and synthetic tasks.
ZeroGen+: Self-Guided High-Quality Data Generation in Efficient Zero-Shot Learning
TLDR
A noise-robust bi-level re-weighting framework which is able to learn the per-sample weights measuring the data quality without requiring any gold data is proposed.
...
...

References

SHOWING 1-10 OF 79 REFERENCES
Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options
TLDR
This work investigates two alternative protocols which automatically create candidate (premise, hypothesis) pairs for annotators to label and concludes that crowdworker writing is still the best known option for entailment data.
New Protocols and Negative Results for Textual Entailment Data Collection
TLDR
Four alternative protocols are proposed, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples, and it is observed that all four new protocols reduce previously observed issues with annotation artifacts.
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
TLDR
It is shown that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators and that often models do not generalize well to examples from annotators that did not contribute to the training set.
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets
TLDR
This work introduces a novel method for efficient dataset curation: a large language model is used to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task.
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?
TLDR
It is found that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty and that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data.
Scarecrow: A Framework for Scrutinizing Machine Text
TLDR
Humans have moreulty spotting errors in higher quality text; accounting for this difference dramatically increases the gap between model-authored and human-authored text.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
TLDR
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
...
...