• Corpus ID: 209439401

BERTje: A Dutch BERT Model

@article{Vries2019BERTjeAD,
  title={BERTje: A Dutch BERT Model},
  author={Wietse de Vries and Andreas van Cranenburgh and Arianna Bisazza and Tommaso Caselli and Gertjan van Noord and Malvina Nissim},
  journal={ArXiv},
  year={2019},
  volume={abs/1912.09582}
}
The transformer-based pre-trained language model BERT has helped to improve state-of-the-art performance on many natural language processing (NLP) tasks. Using the same architecture and parameters, we developed and evaluated a monolingual Dutch BERT model called BERTje. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. BERTje consistently outperforms the equally-sized… 

Tables from this paper

BERTimbau: Pretrained BERT Models for Brazilian Portuguese
TLDR
This work trains BERT (Bidirectional Encoder Representations from Transformers) models for Brazilian Portuguese, which is nickname BERTimbau, and evaluates their models on three downstream NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity recognition.
gaBERT - an Irish Language Model
TLDR
This work introduces, gaBERT, a monolingual BERT model for the Irish language, and compares it to multilingual BERT and shows that gaberT provides better representations for a downstream parsing task.
RoBERT – A Romanian BERT Model
TLDR
This paper introduces a Romanian-only pre-trained BERT model – RoBERT – and compares it with different multi-lingual models on seven Romanian specific NLP tasks grouped into three categories, namely: sentiment analysis, dialect and cross-dialect topic identification, and diacritics restoration.
What the [MASK]? Making Sense of Language-Specific BERT Models
TLDR
The current state of the art in language-specific BERT models is presented, providing an overall picture with respect to different dimensions (i.e. architectures, data domains, and tasks), and an immediate and straightforward overview of the commonalities and differences are provided.
Vietnamese Question Answering System f rom Multilingual BERT Models to Monolingual BERT Model
TLDR
This work has shown that the monolingual model outperforms the multilingual models and is recommended as an option to build a Vietnamese QA system if the system is built from a multilingual BERT based model.
GottBERT: a pure German Language Model
TLDR
GottBERT is a pre-trained related to the original RoBERTa model and outperformed all other tested German and multilingual models on Named Entity Recognition (NER) and text classification tasks GermEval 2018 and GNAD.
The birth of Romanian BERT
TLDR
Romanian BERT is introduced, the first purely Romanian transformer-based language model, pretrained on a large text corpus, and opensource not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production.
Pre-training Polish Transformer-based Language Models at Scale
TLDR
This study presents two language models for Polish based on the popular BERT architecture, one of which was trained on a dataset consisting of over 1 billion polish sentences, or 135GB of raw text, and describes the methodology for collecting the data, preparing the corpus, and pre-training the model.
Overview of the Transformer-based Models for NLP Tasks
TLDR
This paper provides an overview and explanations of the latest auto-regressive models such as GPT, GPT-2 and XLNET, as well as the auto-encoder architecture such as BERT and a lot of post-BERT models like RoBERTa, ALBERT, ERNIE 1.0/2.0.
LaoPLM: Pre-trained Language Models for Lao
TLDR
This work constructs a text classification dataset to alleviate the resource-scarce situation of the Lao language and presents the first transformer-based PTMs for Lao, which are under-represented in Lao NLP research.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 22 REFERENCES
Multilingual is not enough: BERT for Finnish
TLDR
While the multilingual model largely fails to reach the performance of previously proposed methods, the custom Finnish BERT model establishes new state-of-the-art results on all corpora for all reference tasks: part- of-speech tagging, named entity recognition, and dependency parsing.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
FlauBERT: Unsupervised Language Model Pre-training for French
TLDR
This paper introduces and shares FlauBERT, a model learned on a very large and heterogeneous French corpus and applies it to diverse NLP tasks and shows that most of the time they outperform other pre-training approaches.
BERT Rediscovers the Classical NLP Pipeline
TLDR
This work finds that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference.
Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT
TLDR
This paper explores the broader cross-lingual potential of mBERT (multilingual) as a zero shot language transfer model on 5 NLP tasks covering a total of 39 languages from various language families: NLI, document classification, NER, POS tagging, and dependency parsing.
AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets
TLDR
A BERT language understanding model for the Italian language (AlBERTo) is trained, focused on the language used in social networks, specifically on Twitter, obtaining state of the art results in subjectivity, polarity and irony detection on Italian tweets.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
TLDR
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.
The merits of Universal Language Model Fine-tuning for Small Datasets - a case with Dutch book reviews
TLDR
It is found that ULMFiT outperforms SVM for all training set sizes and that satisfactory results (~90%) can be achieved using training sets that can be manually annotated within a few hours.
Deep Contextualized Word Representations
TLDR
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.
What Does BERT Learn about the Structure of Language?
TLDR
This work provides novel support for the possibility that BERT networks capture structural information about language by performing a series of experiments to unpack the elements of English language structure learned by BERT.
...
1
2
3
...