Corpus ID: 195505104

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

@inproceedings{Surez2019AsynchronousPF,
  title={Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures},
  author={Pedro Javier Ortiz Su{\'a}rez and Beno{\^i}t Sagot and L. Romary},
  year={2019}
}
  • Pedro Javier Ortiz Suárez, Benoît Sagot, L. Romary
  • Published 2019
  • Computer Science
  • Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for… CONTINUE READING
    57 Citations

    Figures, Tables, and Topics from this paper

    Explore Further: Topics Discussed in This Paper

    From Web Crawl to Clean Register-Annotated Corpora
    • 2
    • Highly Influenced
    • PDF
    The ELTE.DH Pilot Corpus - Creating a Handcrafted Gigaword Web Corpus with Metadata
    • Highly Influenced
    • PDF
    Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
    • Highly Influenced
    • PDF
    CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
    • 52
    • PDF
    CamemBERT: a Tasty French Language Model
    • 128
    • PDF
    KLEJ: Comprehensive Benchmark for Polish Language Understanding
    • 4
    • PDF

    References

    SHOWING 1-10 OF 26 REFERENCES
    Learning Word Vectors for 157 Languages
    • 567
    • Highly Influential
    • PDF
    Improving Language Understanding by Generative Pre-Training
    • 1,816
    • PDF
    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
    • 869
    • PDF
    Enriching Word Vectors with Subword Information
    • 4,457
    • PDF
    Distributed Representations of Words and Phrases and their Compositionality
    • 20,859
    • Highly Influential
    • PDF
    Polyglot: Distributed Word Representations for Multilingual NLP
    • 405
    • PDF