A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

@inproceedings{Surez2020AMA,
  title={A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages},
  author={Pedro Javier Ortiz Su{\'a}rez and L. Romary and Beno{\^i}t Sagot},
  booktitle={ACL},
  year={2020}
}
We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual… Expand

Tables from this paper

Distributed Deep Learning in Open Collaborations
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
AraGPT2: Pre-Trained Transformer for Arabic Language Generation
Bertinho: Galician BERT Representations
Documenting the English Colossal Clean Crawled Corpus
...
1
2
3
4
...

References

SHOWING 1-10 OF 52 REFERENCES
Energy and Policy Considerations for Deep Learning in NLP
CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Deep contextualized word representations
Learning Word Vectors for 157 Languages
Multilingual BERT
  • https://github.com/google-research/ bert/blob/master/multilingual.md.
  • 2018
Language Models are Unsupervised Multitask Learners
UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task
Enriching Word Vectors with Subword Information
...
1
2
3
4
5
...