A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

@inproceedings{OrtizSurez2020AMA,
  title={A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages},
  author={Pedro Javier Ortiz Su{\'a}rez and Laurent Romary and Beno{\^i}t Sagot},
  booktitle={ACL},
  year={2020}
}
We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual… Expand

Tables from this paper

Distributed Deep Learning in Open Collaborations
TLDR
This work carefully analyzes constraints and proposes a novel algorithmic framework designed specifically for collaborative training of SwAV and ALBERT pretraining in realistic conditions and achieves performance comparable to traditional setups at a fraction of the cost. Expand
Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus
Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora areExpand
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
TLDR
This work manually audit the quality of 205 languagespecific corpora released with five major public datasets, and audit the correctness of language codes in a sixth, and finds that lower-resource corpora have systematic issues. Expand
Adapting Monolingual Models: Data can be Scarce when Language Similarity is High
TLDR
This work retrain the lexical layers of four BERT-based models using data from two low-resource target language varieties, while the Transformer layers are independently finetuned on a POS-tagging task in the model’s source language. Expand
AlephBERT: A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With
TLDR
AlephBERT is presented, a large pre-trained language model for Modern Hebrew, which is trained on larger vocabulary and a larger dataset than any Hebrew PLM before, and made publicly available, providing a single point of entry for the development of Hebrew NLP applications. Expand
AlephBERT: Pre-training and End-to-End Language Model Evaluation from Sub-Word to Sentence Level
  • 2021
Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advancesExpand
AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding
TLDR
The model is pretrained using the replaced token detection objective on large Arabic text corpora and it is shown that AraELECTRA outperforms current state-of-the-art Arabic language representation models, given the same pretraining data and with even a smaller model size. Expand
AraGPT2: Pre-Trained Transformer for Arabic Language Generation
TLDR
The first advanced Arabic language generation model, AraGPT2, trained from scratch on a large Arabic corpus of internet text and news articles is developed and an automatic discriminator model with a 98% percent accuracy in detecting model-generated text is released. Expand
BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation
The success of bidirectional encoders using masked language models, such as BERT, on numerous natural language processing tasks has prompted researchers to attempt to incorporate these pre-trainedExpand
Benchmarking Transformer-based Language Models for Arabic Sentiment and Sarcasm Detection
The introduction of transformer-based language models has been a revolutionary step for natural language processing (NLP) research. These models, such as BERT, GPT and ELECTRA, led toExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 52 REFERENCES
Energy and Policy Considerations for Deep Learning in NLP
TLDR
This paper quantifies the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP and proposes actionable recommendations to reduce costs and improve equity in NLP research and practice. Expand
Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing
We present an extensive evaluation of three recently proposed methods for contextualized embeddings on 89 corpora in 54 languages of the Universal Dependencies 2.3 in three tasks: POS tagging,Expand
CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
TLDR
This overview paper defines the task and the updated evaluation methodology, describes data preparation, report and analyze the main results, and provides a brief categorization of the different approaches of the participating systems. Expand
Deep Contextualized Word Representations
TLDR
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals. Expand
Learning Word Vectors for 157 Languages
TLDR
This paper describes how high quality word representations for 157 languages were trained on the free online encyclopedia Wikipedia and data from the common crawl project, and introduces three new word analogy datasets to evaluate these word vectors. Expand
Multilingual BERT
  • https://github.com/google-research/ bert/blob/master/multilingual.md.
  • 2018
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
Towards Better UD Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation
TLDR
This paper describes the system (HIT-SCIR) submitted to the CoNLL 2018 shared task on Multilingual Parsing from Raw Text to Universal Dependencies, which was ranked first according to LAS and outperformed the other systems by a large margin. Expand
UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task
TLDR
A prototype for UDPipe 2.0 is presented and the system ranked first in the overall score in extrinsic parser evaluation EPE 2018, and it is evaluated in the CoNLL 2018 UD Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Expand
Enriching Word Vectors with Subword Information
TLDR
A new approach based on the skipgram model, where each word is represented as a bag of character n-grams, with words being represented as the sum of these representations, which achieves state-of-the-art performance on word similarity and analogy tasks. Expand
...
1
2
3
4
5
...