Corpus ID: 225094586

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

  title={Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus},
  author={Isaac Caswell and Theresa Breiner and D. V. Esch and Ankur Bapna},
  • Isaac Caswell, Theresa Breiner, +1 author Ankur Bapna
  • Published 2020
  • Computer Science
  • ArXiv
  • Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged… CONTINUE READING

    Figures and Tables from this paper


    Text Normalization Infrastructure that Scales to Hundreds of Language Varieties
    • 10
    • PDF
    N-gram Counts and Language Models from the Common Crawl
    • 120
    • PDF
    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
    • 817
    • PDF
    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
    • 1,109
    • Highly Influential
    • PDF
    CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
    • 45
    • Highly Influential
    • PDF
    Pretraining on Non-linguistic Structure as a Tool for Analyzing Learning Bias in Language Models
    • 5
    Are All Languages Created Equal in Multilingual BERT?
    • 18
    • PDF
    Automatic Language Identification in Texts: A Survey
    • 72
    • PDF
    Unsupervised Machine Translation Using Monolingual Corpora Only
    • 564
    • PDF