CamemBERT: a Tasty French Language Model
This paper investigates the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating their language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks.
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
A general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language is proposed and developed so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint.
Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell
This is the first time that enough unlabeled and annotated data is provided for an emerging user-generated content dialectal language with rich morphology and code switching, making it an challenging test-bed for most recent NLP approaches.
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
This work uses the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages and shows that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.
Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers
In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposed
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
This work manually audit the quality of 205 languagespecific corpora released with five major public datasets, and audit the correctness of language codes in a sixth, and finds that lower-resource corpora have systematic issues.
Establishing a New State-of-the-Art for French Named Entity Recognition
The French TreeBank, the main source of morphosyntactic and syntactic annotations for French, is manually annotated with explicit information related to named entities, after an automatic pre-annotation step.
How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures
Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l’hétérogénéité des données d’entrainement (C AMEM BERT Contextual Language Models for French: Impact of
Les modèles de langue neuronaux contextuels sont désormais omniprésents en traitement automatique des langues. Jusqu’à récemment, la plupart des modèles disponibles ont été entraînés soit sur des
SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German
It is shown that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results.