• Computer Science
  • Published 2019

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

@inproceedings{Surez2019AsynchronousPF,
  title={Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures},
  author={Pedro Javier Ortiz Su{\'a}rez and Beno{\^i}t Sagot and Laurent Romary},
  year={2019}
}
Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for… CONTINUE READING

Topics from this paper.

Citations

Publications citing this paper.
SHOWING 1-4 OF 4 CITATIONS