• Publications
  • Influence
CamemBERT: a Tasty French Language Model
TLDR
We train a monolingual Transformer-based language model on the French language using recent large-scale corpora. Expand
  • 121
  • 25
  • PDF
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
TLDR
We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. Expand
  • 54
  • 6
  • PDF
Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell
TLDR
We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect known for its frequent usage of code-switching. Expand
  • 7
  • 1
  • PDF
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
TLDR
We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. Expand
  • 18
  • PDF
Establishing a New State-of-the-Art for French Named Entity Recognition
TLDR
The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. Expand
  • 3
  • PDF
How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures
TLDR
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. Expand
Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement (C AMEM BERT Contextual Language Models for French: Impact of
TLDR
Les modeles de langue neuronaux contextuels sont desormais omnipresents en traitement automatique des langues. Expand
Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement
Les modeles de langue neuronaux contextuels sont desormais omnipresents en traitement automatique des langues. Jusqu’a recemment, la plupart des modeles disponibles ont ete entraines soit sur desExpand
Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers
In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposedExpand
SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German
TLDR
In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. Expand