• Corpus ID: 236447621

gaBERT — an Irish Language Model

@inproceedings{Barry2021gaBERTA,
  title={gaBERT — an Irish Language Model},
  author={James Barry and Joachim Wagner and Lauren Cassidy and Alan Cowap and Teresa Lynn and Abigail Walsh and M'iche'al J. 'O Meachair and Jennifer Foster},
  booktitle={International Conference on Language Resources and Evaluation},
  year={2021}
}
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering… 

Figures and Tables from this paper

TwittIrish: A Universal Dependencies Treebank of Tweets in Modern Irish

The first Universal Dependencies treebank of Irish tweets is released, facilitating natural language processing of user-generated content in Irish and describing the bootstrapping method of treebank development and report on preliminary parsing experiments.

Use of Transformer-Based Models for Word-Level Transliteration of the Book of the Dean of Lismore

This work outlines the problem of transliterating the text of the BDL into a standardised orthography, and performs exploratory experiments using Transformer-based models for this task.

A BERT’s Eye View: Identification of Irish Multiword Expressions Using Pre-trained Language Models

This paper compares the use of a monolingual BERT model for Irish with multilingual BERT, fine-tuned to perform MWE identification, presenting a series of experiments to explore the impact of hyperparameter tuning and dataset optimisation steps on these models.

Diachronic Parsing of Pre-Standard Irish

A small benchmark corpus, annotated according to the Universal Dependencies guidelines and covering a range of dialects and time periods since 1600 is introduced, and baselines for lemmatization, tagging, and dependency parsing on this corpus are established by experimenting with a variety of machine learning approaches.

References

SHOWING 1-10 OF 55 REFERENCES

Is Multilingual BERT Fluent in Language Generation?

It is found that the currently available multilingual BERT model is clearly inferior to the monolingual counterparts, and cannot in many cases serve as a substitute for a well-trained monolingUAL model.

BERTje: A Dutch BERT Model

The transformer-based pre-trained language model BERT has helped to improve state-of-the-art performance on many natural language processing (NLP) tasks, but a monolingual Dutch BERT model called BERTje is developed and evaluated, which consistently outperforms the equally-sized multilingual Bert model on downstream NLP tasks.

Multilingual is not enough: BERT for Finnish

While the multilingual model largely fails to reach the performance of previously proposed methods, the custom Finnish BERT model establishes new state-of-the-art results on all corpora for all reference tasks: part- of-speech tagging, named entity recognition, and dependency parsing.

WikiBERT Models: Deep Transfer Learning for Many Languages

A simple, fully automated pipeline for creating language-specific BERT models from Wikipedia data is introduced and 42 new such models are introduced, most for languages up to now lacking dedicated deep neural language models.

Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

Using dependency parsing of four diverse low-resource language varieties as a case study, it is shown that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and the importance of the relationship between such models' pretraining data and target language varieties.

What the [MASK]? Making Sense of Language-Specific BERT Models

The current state of the art in language-specific BERT models is presented, providing an overall picture with respect to different dimensions (i.e. architectures, data domains, and tasks), and an immediate and straightforward overview of the commonalities and differences are provided.

ParsBERT: Transformer-based Model for Persian Language Understanding

A monolingual BERT for the Persian language (ParsBERT) is proposed, which shows its state-of-the-art performance compared to other architectures and multilingual models.

Finding Universal Grammatical Relations in Multilingual BERT

An unsupervised analysis method is presented that provides evidence mBERT learns representations of syntactic dependency labels, in the form of clusters which largely agree with the Universal Dependencies taxonomy, suggesting that even without explicit supervision, multilingual masked language models learn certain linguistic universals.

Deep Contextualized Word Representations

A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.

Are All Languages Created Equal in Multilingual BERT?

This work explores how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance, and finds that better models for low resource languages require more efficient pretraining techniques or more data.
...