Corpus ID: 232170369

CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review

  title={CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review},
  author={Dan Hendrycks and Collin Burns and Anya Chen and Spencer Ball},
Many specialized domains remain untouched by deep learning, as large labeled datasets require expensive expert annotators. We address this bottleneck within the legal domain by introducing the Contract Understanding Atticus Dataset (CUAD), a new dataset for legal contract review. CUAD was created with dozens of legal experts from The Atticus Project and consists of over 13,000 annotations. The task is to highlight salient portions of a contract that are important for a human to review. We find… Expand

Figures and Tables from this paper

Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents
This paper releases the Longformer-based pretrained language model, named as Lawformer, for Chinese legal long documents understanding, and demonstrates that the model can achieve promising improvement on tasks with long documents as inputs. Expand
A Statutory Article Retrieval Dataset in French
Statutory article retrieval is the task of automatically retrieving law articles relevant to a legal question. While recent advances in natural language processing have sparked considerable interestExpand
The Law of Large Documents: Understanding the Structure of Legal Contracts Using Visual Cues
This work finds that visual cues such as layout, style, and placement of text in a document are strong features that are crucial to achieving an acceptable level of accuracy on long documents and method of segmenting documents based on structural metadata out-performs existing methods on four long-document understanding tasks. Expand
Representing Long Documents with Contextualized Passage Embeddings
  • 2021
In this study we investigated a method for processing a large document collection with many long documents. The goal was to improve the processing runtime and memory requirements for document levelExpand
DEMix Layers: Disentangling Domains for Modular Language Modeling
It is shown that mixing experts during inference, using a parameter-free weighted ensemble, allows the model to better generalize to heterogeneous or unseen domains, and that experts can be added to iteratively incorporate new domains without forgetting older ones, without additional training. Expand


Large-Scale Multi-Label Text Classification on EU Legislation
This work releases a new dataset of 57k legislative documents from EUR-LEX, annotated with ∼4.3k EUROVOC labels, suitable for LMTC, few- and zero-shot learning, and shows that BIGRUs with label-wise attention perform better than other current state of the art methods. Expand
CJRC: A Reliable Human-Annotated Benchmark DataSet for Chinese Judicial Reading Comprehension
A Chinese judicial reading comprehension dataset which contains approximately 10K documents and almost 50K questions with answers and two strong baseline models based on BERT and BiDAF are built. Expand
How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence
The history, the current state, and the future directions of research in LegalAI are introduced and an in-depth analysis of the advantages and disadvantages of existing works is provided to explore possible future directions. Expand
A Benchmark for Lease Contract Review
This paper focuses on supporting the review of lease agreements, a contract type that has received little attention in the legal information extraction literature, and defines the types of entities and red flags needed for that task. Expand
Extracting contract elements
This work provides a labeled dataset with gold contract element annotations, along with an unlabeled dataset of contracts that can be used to pre-train word embeddings and experimentally compares several contract element extraction methods that use manually written rules and linear classifiers with hand-crafted features, word embedDings, and part-of-speech tag embeddINGS. Expand
COLIEE-2018: Evaluation of the Competition on Legal Information Extraction and Entailment
The evaluation of the 5th Competition on Legal Information Extraction/Entailment 2018 (COLIEE-2018) is summarized, which describes each team’s approaches, the official evaluation, and analysis on the data and submission results. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented. Expand
RoBERTa: A Robustly Optimized BERT Pretraining Approach
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD. Expand