• Corpus ID: 247011456

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

  title={From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French},
  author={Simon Gabay and Pedro Ortiz Suarez and Alexandre Bartz and Alix Chagu'e and Rachel Bawden and Philippe Gambette and Beno{\^i}t Sagot},
Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16 th to… 
1 Citations

Figures and Tables from this paper

BERToldo, the Historical BERT for Italian
The Italian version of historical BERT is introduced, which is called BERToldo, and it is shown that deduplication reduces training time without affecting performance and that duplicated data is rather common for languages with a limited availability of historical corpora.


CamemBERT: a Tasty French Language Model
This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.
Cross-lingual Language Model Pretraining
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
This work uses the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages and shows that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre
The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state of the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e. up to 20th c.novels.
Improving Lemmatization of Non-Standard Languages with Joint Learning
This paper approaches lemmatization as a string-transduction task with an Encoder-Decoder architecture which is enriched with sentence information using a hierarchical sentence encoder and shows significant improvements over the state-of-the-art by fine-tuning the sentence encodings to jointly optimize a bidirectional language model loss.
FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP
The core idea of the FLAIR framework is to present a simple, unified interface for conceptually very different types of word and document embeddings, which effectively hides all embedding-specific engineering complexity and allows researchers to “mix and match” variousembeddings with little effort.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
Recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, and carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values are provided.
FLERT: Document-Level Features for Named Entity Recognition
An evaluation on the classic CoNLL benchmark datasets for 4 languages shows that document-level features significantly improve NER quality and that fine-tuning generally outperforms the feature-based approaches.
Latin BERT: A Contextual Language Model for Classical Philology
It is shown that Latin BERT achieves a new state of the art for part-of-speech tagging on all three Universal Dependency datasets for Latin and can be used for predicting missing text (including critical emendations) and for semantically-informed search by querying contextual nearest neighbors.