• Corpus ID: 245836939

SCROLLS: Standardized CompaRison Over Long Language Sequences

@article{Shaham2022SCROLLSSC,
  title={SCROLLS: Standardized CompaRison Over Long Language Sequences},
  author={Uri Shaham and Elad Segal and Maor Ivgi and Avia Efrat and Ori Yoran and Adi Haviv and Ankit Gupta and Wenhan Xiong and Mor Geva and Jonathan Berant and Omer Levy},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.03533}
}
NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language… 

Figures and Tables from this paper

Efficient Long-Text Understanding with Short-Text Models

TLDR
This work proposes SLED, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs and shows that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.

MuLD: The Multitask Long Document Benchmark

TLDR
MuLD is presented: a new long document benchmark consisting of only documents over 10,000 tokens, which requires models to successfully model long-term dependencies in the text and shows that models with increased context length are better able to solve the tasks presented.

QuALITY: Question Answering with Long Input Texts, Yes!

TLDR
QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process, is introduced to enable building and testing models on long-document comprehension.

ChapterBreak: A Challenge Dataset for Long-Range Language Models

TLDR
This work introduces C HAPTER B REAK, a challenge dataset that provides an LRLM with a long segment from a narrative that ends at a chapter boundary and asks it to distinguish the beginning of the ground-truth next chapter from a set of negative segments sampled from the same narrative.

Investigating Efficiently Extending Transformers for Long Input Summarization

TLDR
PEGASUS-X is introduced, an extension of the PEGASus model with additional long input pretraining to handle inputs of up to 16K tokens and achieves strong performance on long input summarization tasks comparable with much larger models while adding few additional parameters and not requiring model parallelism to train.

Characterizing the Efficiency vs. Accuracy Trade-off for Long-Context NLP Models

TLDR
A systematic study of the accuracy vs. efficiency trade-off on two widely used long-sequence models - Longformer-Encoder-Decoder (LED) and Big Bird - during fine-tuning and inference on four datasets from the SCROLLS benchmark finds that increasing model size is more energy efficient than increasing sequence length for higher accuracy.

Conditional Generation with a Question-Answering Blueprint

TLDR
This work proposes a new conceptualization of text plans as a sequence of question-answer (QA) pairs, enhancing existing datasets with a QA blueprint operating as a proxy for both content selection and planning.

The NLP Task Effectiveness of Long-Range Transformers

TLDR
It is found that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens.

TRUE: Re-evaluating Factual Consistency Evaluation

TLDR
TRUE is introduced: a comprehensive survey and assessment of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency, and it is found that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.

TRUE: Re-evaluating Factual Consistency Evaluation

TLDR
This work introduces TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency, and finds that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.

References

SHOWING 1-10 OF 37 REFERENCES

Do Long-Range Language Models Actually Use Long-Range Context?

TLDR
This paper performs a fine-grained analysis of two long-range Transformer language models (including the Routing Transformer, which achieves state-of-the-art perplexity on the PG-19 long-sequence LM benchmark dataset) that accept input sequences of up to 8K tokens and discovers that long-ranging context helps most for literary novels.

Shortformer: Better Language Modeling using Shorter Inputs

TLDR
This work identifies conditions where shorter inputs are not harmful, and achieves perplexity and efficiency improvements through two new methods that decrease input length, and shows how to improve the efficiency of recurrence methods in transformers.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

TLDR
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

The NarrativeQA Reading Comprehension Challenge

TLDR
A new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts are presented, designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

TLDR
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.

Longformer: The Long-Document Transformer

TLDR
Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

TLDR
BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.

SummScreen: A Dataset for Abstractive Screenplay Summarization

TLDR
Human evaluation and qualitative analysis reveal that the non-oracle models are competitive with their oracle counterparts in terms of generating faithful plot events and can benefit from better content selectors.

Synthesizer: Rethinking Self-Attention in Transformer Models

TLDR
The true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models is investigated and a model that learns synthetic attention weights without token-token interactions is proposed, called Synthesizer.