• Corpus ID: 235732082

Scarecrow: A Framework for Scrutinizing Machine Text

@article{Dou2021ScarecrowAF,
  title={Scarecrow: A Framework for Scrutinizing Machine Text},
  author={Yao Dou and Maxwell Forbes and Rik Koncel-Kedziorski and Noah A. Smith and Yejin Choi},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.01294}
}
Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures. To facilitate research of these complex error types, we introduce a new structured, crowdsourced error annotation schema called S CARECROW . The error categories used in S CARECROW —such as redundancy, commonsense errors, and incoherence… 

Unraveling the Mystery of Artifacts in Machine Generated Text

TLDR
This work proposes to systematically study the forms and scopes of artifacts by corrupting text, replacing them with linguistic or statistical features, and applying the interpretable method of Integrated Gradients.

How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

TLDR
AVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure, is introduced, showing that GPT-2's novel text is usually well-formed morphologically and syntactically but has reasonably frequent semantic issues.

Unsupervised and Distributional Detection of Machine-Generated Text

TLDR
This paper proposes a method to detect those machine-generated documents leveraging repeated higher-order n-grams, which show over-appear in machine- generated text as compared to human ones.

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

TLDR
This work introduces a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluative power of humans.

Cut the CARP: Fishing for zero-shot story evaluation

TLDR
A strong correlation between human evaluation of stories and those of carp is shown, and model outputs more significantly correlate with corresponding human input than those language-model based methods which utilize finetuning or prompt engineering approaches.

Textinator: an Internationalized Tool for Annotation and Human Evaluation in Natural Language Processing and Generation

TLDR
An internationalized annotation and human evaluation bundle, called Textinator, is released along with documentation and video tutorials, and a thorough systematic comparison of Textinator to previously published annotation tools along 9 different axes is presented.

PLANET: Dynamic Content Planning in Autoregressive Transformers for Long-form Text Generation

TLDR
PANET is proposed, a novel generation framework leveraging autoregressive self-attention mechanism to conduct content planning and surface realization dynamically and introduces a new coherence-based contrastive learning objective to further improve the coherence of output.

Event Transition Planning for Open-ended Text Generation

TLDR
A novel two-stage method which explicitly arranges the ensuing events in open-ended text generation and effectively improves the quality of the generated text, especially in coherence and diversity.

Do Language Models Plagiarize?

TLDR
The findings support that language models, especially GPT-2, reuse particular pieces of texts from the training corpus with or without obfuscation, and implies that future research on neural language models should take precautions to avoid models plagiarizing their training datasets.

ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection

TLDR
ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups, is created and it is demonstrated that finetuning a toxicity classifier on data improves its performance on human-written data substantially.

References

SHOWING 1-10 OF 38 REFERENCES

The Curious Case of Neural Text Degeneration

TLDR
By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text

TLDR
This work introduces a novel evaluation task based on detecting the boundary at which a text passage that starts off human-written transitions to being machine-generated.

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

TLDR
UNION is a learnable unreferenced metric for evaluating open-ended story generation, which measures the quality of a generated story without any reference, which correlates better with human judgments and is more generalizable than existing state-of-the-art metrics.

Unifying Human and Statistical Evaluation for Natural Language Generation

TLDR
This paper proposes a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated, called HUSE, which is efficiently estimated by combining human and statistical evaluation.

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

TLDR
This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks and provides formal granular evaluation metrics and identifies areas for future research.

MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text Generation

TLDR
MAUVE is a metric for open-ended text generation, which directly compares the distribution of machine-generated text to that of human language, and shows that evaluation under MAUVE reflects the more natural behavior with respect to model size, compared to prior metrics.

Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm

TLDR
It is suggested that the function of few-shot examples in these cases is better described as locating an already learned task rather than meta-learning, which motivates rethinking the role of prompts in controlling and evaluating powerful language models.

MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

TLDR
MCTest is presented, a freely available set of stories and associated questions intended for research on the machine comprehension of text that requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension.

TuringAdvice: A Generative and Dynamic Evaluation of Language Use

TLDR
Empirical results show that today’s models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples, and this low performance reveals language understanding errors that are hard to spot outside of a generative setting.

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

TLDR
This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.