• Corpus ID: 230435736

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

@article{Gao2021ThePA,
  title={The Pile: An 800GB Dataset of Diverse Text for Language Modeling},
  author={Leo Gao and Stella Rose Biderman and Sid Black and Laurence Golding and Travis Hoppe and Charles Foster and Jason Phang and Horace He and Anish Thite and Noa Nabeshima and Shawn Presser and Connor Leahy},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.00027}
}
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile : an 825 GiB English text corpus tar-geted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources. Our evaluation of the untuned… 
mGPT: Few-Shot Learners Go Multilingual
TLDR
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages from 25 language families using Wikipedia and Colossal Clean Crawled Corpus, and trains small versions of the model to choose the most optimal multilingual tokenization strategy.
ANNA”:" Enhanced Language Representation for Question Answering
TLDR
This paper proposes an extended pre- training task, and a new neighbor-aware mechanism that attends neighboring tokens more to capture the richness of context for pre-training language modeling.
Improving language models by retrieving from trillions of tokens
TLDR
Transformers have been scaled from 100 million parameter models in seminal work to over hundred billion parameters in the last two years which has led to models that do very well on a wide array of tasks in a zero or few-shot formulation.
Language Models are Few-shot Multilingual Learners
TLDR
It is shown that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones, and they are competitive compared to the existing state-of-the-art cross-lingual models and translation models.
A Large and Diverse Arabic Corpus for Language Modeling
TLDR
This work elaborates on the design and development of a large Arabic corpus that consists of over 500 GB of Arabic cleaned text targeted at improving cross-domain knowledge and downstream generalization capability of large-scale language models.
Documenting the English Colossal Clean Crawled Corpus
TLDR
This work provides some of the first documentation of the English Colossal Clean Crawled Corpus (C4), one of the largest corpora of text available, and hosts an indexed version of C4 at https://c4-search.allenai.org/, allowing anyone to search it.
FPM: A Collection of Large-scale Foundation Pre-trained Language Models
TLDR
The current ef- fective model structure is used to launch a model set through the current most mainstream technology and it is thought this will become the basic model in the future.
ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data
TLDR
Supporting data evidence of the model’s task-specific competence from pretraining is sought and a novel method ORCA is proposed to effectively identify it, by iteratively using gradient information related to the downstream task.
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
TLDR
This work provides some of the first documentation for the Colossal Clean Crawled Corpus (C4), a dataset created by applying a set of filters to a single snapshot of Common Crawl, and evaluates the text that was removed, and shows that blocklist filtering disproportionately removes text from and about minority individuals.
Deduplicating Training Data Makes Language Models Better
TLDR
Two tools are developed that allow us to deduplicate training datasets and train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy.
...
...

References

Test perplexity of the Pile using GPT-2 and GPT-3. Evaluation is performed on one-tenth of the test data of the Pile