• Corpus ID: 233296858

Documenting the English Colossal Clean Crawled Corpus

@article{Dodge2021DocumentingTE,
  title={Documenting the English Colossal Clean Crawled Corpus},
  author={Jesse Dodge and Maarten Sap and Ana Marasovi{\'c} and William Agnew and Gabriel Ilharco and Dirk Groeneveld and Matt Gardner},
  journal={ArXiv},
  year={2021},
  volume={abs/2104.08758}
}
As language models are trained on ever more text, researchers are turning to some of the largest corpora available. Unlike most other types of datasets in NLP, large unlabeled text corpora are often presented with minimal documentation, and best practices for documenting them have not been established. In this work we provide the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common… 

Figures and Tables from this paper

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

TLDR
This work manually audit the quality of 205 language-specific corpora released with five major public datasets and recommends techniques to evaluate and improve multilingual corpora and discusses potential risks that come with low-quality data releases.

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

TLDR
It is argued that more care is needed to construct training corpora for language models with better transparency and justification for the inclusion or exclusion of various texts, and that privileging any corpus as high quality entails a language ideology.

SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets

TLDR
This work introduces a novel method for efficient dataset curation: a large language model is used to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task.

SciFive: a text-to-text transformer model for biomedical literature

TLDR
The SciFive model outperforms the current SOTA methods on tasks in named entity relation, relation extraction, natural language inference, and questionanswering and shows that text-generation methods have significant potential in a broad array of biomedical NLP tasks, particularly those requiring longer, more complex outputs.

Mitigating harm in language models with conditional-likelihood filtration

TLDR
This work presents a methodology for programmatically identifying and removing harmful text from web-scale datasets and discusses the generalization of this method and how trigger phrases reflecting specific values can be used by researchers to build language models which are more closely aligned with their values.

Enhance Text-to-Text Transfer Transformer with Generated Questions for Thai Question Answering

TLDR
This study aims to improve the performance of Thai QA models by generating more question-answer pairs with Multilingual Text-to-Text Transfer Transformer (mT5) along with data preprocessing methods for Thai and shows that the augmented model is the winner on both datasets compared to other modern transformer models: Roberta and mT5.

Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpus

TLDR
An initial effort to provide a datasheet for BookCorpus offers a cautionary case study and adds to growing literature that urges more careful, systematic documentation of machine learning datasets.

Multimodal datasets: misogyny, pornography, and malignant stereotypes

TLDR
The recently released LAION-400M dataset is examined, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset, and it is found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.

Time Waits for No One! Analysis and Challenges of Temporal Misalignment

TLDR
It is found that, while temporal adaptation through continued pretraining can help, these gains are small compared to task-specific finetuning on data from the target time period, which motivates continued research to improve temporal robustness of NLP models.

LSH methods for data deduplication in a Wikipedia artificial dataset

TLDR
This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset using English Wikipedia articles to evaluate the different models.

References

SHOWING 1-10 OF 51 REFERENCES

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

TLDR
This work manually audit the quality of 205 language-specific corpora released with five major public datasets and recommends techniques to evaluate and improve multilingual corpora and discusses potential risks that come with low-quality data releases.

A Massive Collection of Cross-Lingual Web-Document Pairs

TLDR
A new web dataset consisting of 54 million URL pairs from Common Crawl covering documents in 92 languages paired with English is released and the quality of machine translations from models that have been trained on mined parallel sentence pairs from this aligned corpora is evaluated.

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

TLDR
An automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages by following the data processing introduced in fastText, that deduplicates documents and identifies their language.

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

TLDR
This work uses the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages and shows that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

TLDR
This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.

Extracting Training Data from Large Language Models

TLDR
This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model, and finds that larger models are more vulnerable than smaller models.

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

TLDR
Recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, and carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values are provided.
...