• Corpus ID: 246276138

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

@article{Gururangan2022WhoseLC,
  title={Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection},
  author={Suchin Gururangan and Dallas Card and Sarah K. Drier and Emily Kalah Gade and Leroy Z. Wang and Zeyu Wang and Luke Zettlemoyer and Noah A. Smith},
  journal={ArXiv},
  year={2022},
  volume={abs/2201.10474}
}
Language models increasingly rely on massive 001 web dumps for diverse text data. However, 002 these sources are rife with undesirable content. 003 As such, resources like Wikipedia, books, and 004 news often serve as anchors for automatically 005 selecting web text most suitable for language 006 modeling, a process typically referred to as 007 quality filtering. Using a new dataset of U.S. 008 high school newspaper articles—written by stu009 dents from across the country—we investigate 010… 

Language Contamination Helps Explain the Cross-lingual Capabilities of English Pretrained Models

English pretrained language models, which make up the backbone of many modern NLP systems, require huge amounts of unlabeled training data. These models are generally presented as being trained only

Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models

English pretrained language models, which make up the backbone of many modern NLP systems, require huge amounts of unlabeled training data. These models are generally presented as being trained only

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging

Prompting PaLM for Translation: Assessing Strategies and Performance

An in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date, and an analysis of PaLM’s MT output which reveals some interesting properties and prospects for future work.

Dataset Debt in Biomedical Language Modeling

A crowdsourced curation of datasheets for 167 biomedical datasets finds that only 13% of datasets are available via programmatic access and 30% lack any documentation on licensing and permitted reuse.

Cultural Re-contextualization of Fairness Research in Language Technologies in India

Recent research has revealed undesirable biases in NLP data and models. However, these efforts largely focus on social disparities in the West, and are not directly portable to other geo-cultural

Re-contextualizing Fairness in NLP: The Case of India

Recent research has revealed undesirable biases in NLP data and models. However, these efforts focus of social disparities in West, and are not directly portable to other geo-cultural contexts. In

References

SHOWING 1-10 OF 113 REFERENCES

Documenting the English Colossal Clean Crawled Corpus

This work provides some of the first documentation of the English Colossal Clean Crawled Corpus (C4), one of the largest corpora of text available, and hosts an indexed version of C4 at https://c4-search.allenai.org/, allowing anyone to search it.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

This work presents the Pile, an 825 GiB English text corpus tar-geted at training large-scale language models, constructed from 22 diverse high-quality subsets—both existing and newly constructed—many of which derive from academic or professional sources.

Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection

This work disentangle what is annotated as toxic by considering posts with three characteristics: anti-Black language, African American English dialect, and vulgarity, and shows strong associations between annotator identity and beliefs and their ratings of toxicity.

Language (Technology) is Power: A Critical Survey of “Bias” in NLP

A greater recognition of the relationships between language and social hierarchies is urged, encouraging researchers and practitioners to articulate their conceptualizations of “bias” and to center work around the lived experiences of members of communities affected by NLP systems.

What to do about bad language on the internet

A critical review of the NLP community's response to the landscape of bad language is offered, and a quantitative analysis of the lexical diversity of social media text, and its relationship to other corpora is presented.

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

It is shown that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels, and that HTLM is highly effective at autoprompting itself.

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

An automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages by following the data processing introduced in fastText, that deduplicates documents and identifies their language.

Demographic Dialectal Variation in Social Media: A Case Study of African-American English

A case study of dialectal language in online conversational text by investigating African-American English (AAE) on Twitter and proposes a distantly supervised model to identify AAE-like language from demographics associated with geo-located messages, and verifies that this language follows well-known AAE linguistic phenomena.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Unsupervised Cross-lingual Representation Learning at Scale

It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.
...