• Corpus ID: 244346583

RoBERTuito: a pre-trained language model for social media text in Spanish

@article{Prez2021RoBERTuitoAP,
  title={RoBERTuito: a pre-trained language model for social media text in Spanish},
  author={Juan Manuel P{\'e}rez and Dami{\'a}n Ariel Furman and Laura Alonso Alemany and Franco Mart{\'i}n Luque},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.09453}
}
Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for natural language processing tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks; however, for languages other than English, such models are not widely available. In… 

Figures and Tables from this paper

ALBETO and DistilBETO: Lightweight Spanish Language Models
TLDR
AlbETO and DistilBETO are presented, which are versions of ALBERT anddistilBERT pre-trained exclusively on Spanish corpora, and their performance improves despite having fewer parameters in other tasks, such as Natural Inference, Paraphrase Identification, and Named Entity Recognition.
XLM-EMO: Multilingual Emotion Prediction in Social Media Text
TLDR
A multilingual emotion prediction model for social media data, XLM-EMO, shows competitive performance in a zero-shot setting, suggesting it is helpful in the context of low-resource languages.

References

SHOWING 1-10 OF 41 REFERENCES
Spanish Language Models
TLDR
The Spanish RoberTa-base and RoBERTa-large models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work.
BERTweet: A pre-trained language model for English Tweets
TLDR
BERTweet is presented, the first public large-scale pre-trained language model for English Tweets, trained using the RoBERTa pre-training procedure, producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks.
AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets
TLDR
A BERT language understanding model for the Italian language (AlBERTo) is trained, focused on the language used in social networks, specifically on Twitter, obtaining state of the art results in subjectivity, polarity and irony detection on Italian tweets.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.
Are Multilingual Models Effective in Code-Switching?
TLDR
The findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching, while using meta-embeddings achieves similar results with significantly fewer parameters.
Spanish pre-trained bert model and evaluation data
  • 2020
Spanish pre-trained bert model and evaluation data. PML4DC at ICLR
  • 2020
RoBERTa: A Robustly Optimized BERT Pretraining Approach
TLDR
It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
...
1
2
3
4
5
...