• Corpus ID: 247011456

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

@article{Gabay2022FromFT,
  title={From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French},
  author={Simon Gabay and Pedro Ortiz Suarez and Alexandre Bartz and Alix Chagu'e and Rachel Bawden and Philippe Gambette and Beno{\^i}t Sagot},
  journal={ArXiv},
  year={2022},
  volume={abs/2202.09452}
}
Language models for historical states of language are becoming increasingly important to allow the optimal digitisation and analysis of old textual sources. Because these historical states are at the same time more complex to process and more scarce in the corpora available, specific efforts are necessary to train natural language processing (NLP) tools adapted to the data. In this paper, we present our efforts to develop NLP tools for Early Modern French (historical French from the 16 th to… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 46 REFERENCES
CamemBERT: a Tasty French Language Model
TLDR
This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.
Cross-lingual Language Model Pretraining
TLDR
This work proposes two methods to learn cross-lingual language models (XLMs): one unsupervised that only relies on monolingual data, and one supervised that leverages parallel data with a new cross-lingsual language model objective.
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
TLDR
This work uses the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages and shows that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre
TLDR
The use of a recent lemmatiser based on neural networks and a CRF tagger allows to achieve accuracies beyond the current state of the art on the in-domain test, and proves to be robust during out-of-domain tests, i.e. up to 20th c.novels.
Improving Lemmatization of Non-Standard Languages with Joint Learning
TLDR
This paper approaches lemmatization as a string-transduction task with an Encoder-Decoder architecture which is enriched with sentence information using a hierarchical sentence encoder and shows significant improvements over the state-of-the-art by fine-tuning the sentence encodings to jointly optimize a bidirectional language model loss.
FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP
TLDR
The core idea of the FLAIR framework is to present a simple, unified interface for conceptually very different types of word and document embeddings, which effectively hides all embedding-specific engineering complexity and allows researchers to “mix and match” variousembeddings with little effort.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
TLDR
Recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, and carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values are provided.
Latin BERT: A Contextual Language Model for Classical Philology
TLDR
It is shown that Latin BERT achieves a new state of the art for part-of-speech tagging on all three Universal Dependency datasets for Latin and can be used for predicting missing text (including critical emendations) and for semantically-informed search by querying contextual nearest neighbors.
Lexically Aware Semi-Supervised Learning for OCR Post-Correction
TLDR
This paper introduces a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding.
...
...