STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation

  title={STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation},
  author={Nader Akoury and Shufan Wang and Josh Whiting and Stephen Hood and Nanyun Peng and Mohit Iyyer},
  journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  • Nader AkouryShufan Wang Mohit Iyyer
  • Published 2020
  • Computer Science
  • Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Systems for story generation are asked to produce plausible and enjoyable stories given an input context. This task is underspecified, as a vast number of diverse stories can originate from a single input. The large output space makes it difficult to build and evaluate story generation models, as (1) existing datasets lack rich enough contexts to meaningfully guide models, and (2) existing evaluations (both crowdsourced and automatic) are unreliable for assessing long-form creative text. To… 

Figures and Tables from this paper

TaleBrush: Sketching Stories with Generative Pretrained Language Models

TaleBrush is introduced, a generative story ideation tool that uses line sketching interactions with a GPT-based language model for control and sensemaking of a protagonist’s fortune in co-created stories and a reflection on how Sketching interactions can facilitate the iterative human-AI co-creation process.

TVRecap: A Dataset for Generating Stories with Character Descriptions

TVRECAP is introduced, a story generation dataset that requires generating detailed TV show episode recaps from a brief summary and a set of documents describing the characters involved, and the best-performing model is found that uses oracle content selectors for character descriptions.

LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text Understanding and Generation

A story-centric benchmark named LOT is proposed for evaluating Chinese long text modeling, which aggregates two understanding tasks and two generation tasks and shows that LongLM outperforms similar-sized pretraining models substantially on both the understanding and generation tasks in LOT.

GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation

This work considers design choices for the annotation interface used to elicit human judgments and their impact on reproducibility, and develops an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators.

WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections

Qualitative analysis shows that the best approaches can generate fluent and high quality texts but they struggle with coherence and factuality, showing the potential for the WIKITABLET dataset to inspire future work on long-form generation.

IGA: An Intent-Guided Authoring Assistant

An interactive writing assistant that generates and rephrases text according to fine-grained author specifications and fine-tune a language model on a dataset heuristically-labeled with author intent is built.

A Multi-Modal Story Generation Framework with AI-Driven Storyline Guidance

This work proposes a novel multi-modal story generation framework that includes automated storyline decision-making capabilities and demonstrates that the model outperforms the previous approach, suggesting the effectiveness of the storyline guidance model in making proper plans.

FairyTailor: A Multimodal Generative Framework for Storytelling

FairyTailor is the first dynamic tool for multimodal story generation that allows interactive coformation of both texts and images and allows users to give feedback on co-created stories and share their results.

Wordcraft: a Human-AI Collaborative Editor for Story Writing

Wordcraft, an AI-assisted editor for story writing in which a writer and a dialog system collaborate to write a story is proposed, which provides a sandbox for writers to probe the boundaries of transformer-based language models and paves the way for future human-in-the-loop training pipelines and novel evaluation methods.

Hurdles to Progress in Long-form Question Answering

The task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress, and a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 LFQA dataset is designed.



A Knowledge-Enhanced Pretraining Model for Commonsense Story Generation

A knowledge-enhanced pretraining model to utilize commonsense knowledge from external knowledge bases to generate reasonable stories that can generate more reasonable stories than state-of-the-art baselines, particularly in terms of logic and global coherence.

Hierarchical Neural Story Generation

This work collects a large dataset of 300K human-written stories paired with writing prompts from an online forum that enables hierarchical story generation, where the model first generates a premise, and then transforms it into a passage of text.

Learning to Tell Tales: A Data-driven Approach to Story Generation

This paper creates an end-to-end system that realizes the various components of the generation pipeline stochastically and follows a generate- and-and-rank approach where the space of multiple candidate stories is pruned by considering whether they are plausible, interesting, and coherent.

Quality Signals in Generated Stories

The problem of measuring the quality of automatically-generated stories is studied to identify what makes a story continuation interesting, relevant, and have high overall quality.

A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories

A new framework for evaluating story understanding and script learning: the `Story Cloze Test’, which requires a system to choose the correct ending to a four-sentence story, and a new corpus of 50k five- Sentence commonsense stories, ROCStories, to enable this evaluation.

Plan, Write, and Revise: an Interactive System for Open-Domain Story Generation

A neural narrative generation system that interacts with humans to generate stories and finds that humans tasked with collaboratively improving a particular characteristic of a story are in fact able to do so, which has implications for future uses of human-in-the-loop systems.

Plan-And-Write: Towards Better Automatic Storytelling

Experiments show that with explicit storyline planning, the generated stories are more diverse, coherent, and on topic than those generated without creating a full plan, according to both automatic and human evaluations.

Do Massively Pretrained Language Models Make Better Storytellers?

It is found that although GPT2-117 conditions more strongly on context, is more sensitive to ordering of events, and uses more unusual words, it is just as likely to produce repetitive and under-diverse text when using likelihood-maximizing decoding algorithms.

Content Planning for Neural Story Generation with Aristotelian Rescoring

This work uses a plot-generation language model along with an ensemble of rescoring models that each implement an aspect of good story-writing as detailed in Aristotle's Poetics to present a system that focuses on how to learn good plot structures to guide story generation.

Towards Controllable Story Generation

A general framework of analyzing existing story corpora to generate controllable and creative new stories and applies the framework to build recurrent neural network (RNN)-based generation models to control story ending valence and storyline.