• Corpus ID: 246276138

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

@article{Gururangan2022WhoseLC,
  title={Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection},
  author={Suchin Gururangan and Dallas Card and Sarah K. Drier and Emily Kalah Gade and Leroy Z. Wang and Zeyu Wang and Luke Zettlemoyer and Noah A. Smith},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.10474}
}
Language models increasingly rely on massive 001 web dumps for diverse text data. However, 002 these sources are rife with undesirable content. 003 As such, resources like Wikipedia, books, and 004 news often serve as anchors for automatically 005 selecting web text most suitable for language 006 modeling, a process typically referred to as 007 quality filtering. Using a new dataset of U.S. 008 high school newspaper articles—written by stu009 dents from across the country—we investigate 010… 
Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models
English pretrained language models, which make up the backbone of many modern NLP systems, require huge amounts of unlabeled training data. These models are generally presented as being trained only
Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging
Dataset Debt in Biomedical Language Modeling
TLDR
A crowdsourced curation of datasheets for 167 biomedical datasets finds that only 13% of datasets are available via programmatic access and 30% lack any documentation on licensing and permitted reuse.

References

SHOWING 1-10 OF 113 REFERENCES
Documenting the English Colossal Clean Crawled Corpus
TLDR
This work provides some of the first documentation of the English Colossal Clean Crawled Corpus (C4), one of the largest corpora of text available, and hosts an indexed version of C4 at https://c4-search.allenai.org/, allowing anyone to search it.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
TLDR
This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.
Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection
TLDR
This work disentangle what is annotated as toxic by considering posts with three characteristics: anti-Black language, African American English dialect, and vulgarity, and shows strong associations between annotator identity and beliefs and their ratings of toxicity.
Language (Technology) is Power: A Critical Survey of “Bias” in NLP
TLDR
A greater recognition of the relationships between language and social hierarchies is urged, encouraging researchers and practitioners to articulate their conceptualizations of “bias” and to center work around the lived experiences of members of communities affected by NLP systems.
What to do about bad language on the internet
TLDR
A critical review of the NLP community's response to the landscape of bad language is offered, and a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora is presented.
HTLM: Hyper-Text Pre-Training and Prompting of Language Models
TLDR
It is shown that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels, and that HTLM is highly effective at autoprompting itself.
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
TLDR
An automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages by following the data processing introduced in fastText, that deduplicates documents and identifies their language.
Diffusion of Lexical Change in Social Media
TLDR
Using a latent vector autoregressive model to aggregate across thousands of words, high-level patterns in diffusion of linguistic change over the United States are identified and support for prior arguments that focus on geographical proximity and population size is offered.
Demographic Dialectal Variation in Social Media: A Case Study of African-American English
TLDR
A case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter and proposes a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and verifies that this language follows well-known AAE linguistic phenomena.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
...
...