CamemBERT: a Tasty French Language Model

@article{Martin2020CamemBERTAT,
  title={CamemBERT: a Tasty French Language Model},
  author={Louis Martin and Benjamin Muller and Pedro Ortiz Suarez and Yoann Dupont and Laurent Romary and Eric Villemonte de la Clergerie and Djam{\'e} Seddah and Beno{\^i}t Sagot},
  journal={ArXiv},
  year={2020},
  volume={abs/1911.03894}
}
Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on… 

Tables from this paper

FlauBERT: Unsupervised Language Model Pre-training for French
TLDR
This paper introduces and shares FlauBERT, a model learned on a very large and heterogeneous French corpus and applies it to diverse NLP tasks and shows that most of the time they outperform other pre-training approaches.
Pre-training Polish Transformer-based Language Models at Scale
TLDR
This study presents two language models for Polish based on the popular BERT architecture, one of which was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text, and describes the methodology for collecting the data, preparing the corpus, and pre-training the model.
BanglaBERT: Combating Embedding Barrier for Low-Resource Language Understanding
TLDR
This work builds a Bangla natural language understanding model pre-trained on 18.6 GB data crawled from top Bangla sites on the internet, and identifies a major shortcoming of multilingual models that hurt performance for low-resource languages that don’t share writing scripts with any high resource one, named the ‘Embedding Barrier’.
Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages
TLDR
This work compares the efficacy of fine-tuning model parameters of pre-trained models against that of training a language model from scratch, and empirically argues against the strict dependency between the dataset size and model performance, but rather encourage task-specific model and method selection.
GottBERT: a pure German Language Model
TLDR
GottBERT is a pre-trained related to the original RoBERTa model and outperformed all other tested German and multilingual models on Named Entity Recognition (NER) and text classification tasks GermEval 2018 and GNAD.
When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models
TLDR
It is shown that transliterating unseen languages significantly improves the potential of large-scale multilingual language models on downstream tasks and provides a promising direction towards making these massively multilingual models useful for a new set of unseen languages.
WikiBERT Models: Deep Transfer Learning for Many Languages
TLDR
A simple, fully automated pipeline for creating language-specific BERT models from Wikipedia data is introduced and 42 new such models are introduced, most for languages up to now lacking dedicated deep neural language models.
GREEK-BERT: The Greeks visiting Sesame Street
TLDR
This paper presents GREEK-BERT, a monolingual BERT -based language model for modern Greek, and evaluates its performance in three NLP tasks, i.e., part-of-speech tagging, named entity recognition, and natural language inference, obtaining state- of-the-art performance.
FQuAD: French Question Answering Dataset
TLDR
The present work introduces the French Question Answering Dataset (FQuAD), a French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 and 1.1 versions.
BERTimbau: Pretrained BERT Models for Brazilian Portuguese
TLDR
This work trains BERT (Bidirectional Encoder Representations from Transformers) models for Brazilian Portuguese, which is nickname BERTimbau, and evaluates their models on three downstream NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity recognition.
...
...

References

SHOWING 1-10 OF 83 REFERENCES
FlauBERT: Unsupervised Language Model Pre-training for French
TLDR
This paper introduces and shares FlauBERT, a model learned on a very large and heterogeneous French corpus and applies it to diverse NLP tasks and shows that most of the time they outperform other pre-training approaches.
XNLI: Evaluating Cross-lingual Sentence Representations
TLDR
This work constructs an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus to 14 languages, including low-resource languages such as Swahili and Urdu and finds that XNLI represents a practical and challenging evaluation suite and that directly translating the test data yields the best performance among available baselines.
FQuAD: French Question Answering Dataset
TLDR
The present work introduces the French Question Answering Dataset (FQuAD), a French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 and 1.1 versions.
Multilingual is not enough: BERT for Finnish
TLDR
While the multilingual model largely fails to reach the performance of previously proposed methods, the custom Finnish BERT model establishes new state-of-the-art results on all corpora for all reference tasks: part- of-speech tagging, named entity recognition, and dependency parsing.
75 Languages, 1 Model: Parsing Universal Dependencies Universally
TLDR
It is found that fine-tuning a multilingual BERT self-attention model pretrained on 104 languages can meet or exceed state-of-the-art UPOS, UFeats, Lemmas, (and especially) UAS, and LAS scores, without requiring any recurrent or language-specific components.
What the [MASK]? Making Sense of Language-Specific BERT Models
TLDR
The current state of the art in language-specific BERT models is presented, providing an overall picture with respect to different dimensions (i.e. architectures, data domains, and tasks), and an immediate and straightforward overview of the commonalities and differences are provided.
Cross-lingual Language Model Pretraining
TLDR
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Learning Word Vectors for 157 Languages
TLDR
This paper describes how high quality word representations for 157 languages were trained on the free online encyclopedia Wikipedia and data from the common crawl project, and introduces three new word analogy datasets to evaluate these word vectors.
Deep Contextualized Word Representations
TLDR
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.
...
...