Corpus ID: 235899361

Spanish Language Models

  title={Spanish Language Models},
  author={Asier Guti'errez-Fandino and Jordi Armengol-Estap'e and Marc P{\`a}mies and Joan Llop-Palao and Joaqu'in Silveira-Ocampo and C. Carrino and A. Gonzalez-Agirre and Carme Armentano-Oller and Carlos Rodr{\'i}guez-Penagos and Marta Villegas},
This paper presents the Spanish RoBERTa-base and RoBERTa-large models, as well as the corresponding performance evaluations. Both models were pre-trained using the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text processed for this work, compiled from the web crawlings performed by the National Library of Spain from 2009 to 2019. We extended the current evaluation datasets with an extractive Question Answering dataset and our models outperform the… Expand

Tables from this paper


Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan
This work builds a clean, high-quality textual Catalan corpus, trains a Transformerbased language model for Catalan, and devise a thorough evaluation in a diversity of settings, to explore to what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models. Expand
SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability
This year, the participants were challenged with new datasets in English and Spanish, and the annotations for both subtasks leveraged crowdsourcing, and a pilot task on interpretable STS, where systems needed to add an explanatory layer. Expand
Multilingual is not enough: BERT for Finnish
While the multilingual model largely fails to reach the performance of previously proposed methods, the custom Finnish BERT model establishes new state-of-the-art results on all corpora for all reference tasks: part- of-speech tagging, named entity recognition, and dependency parsing. Expand
SemEval-2014 Task 10: Multilingual Semantic Textual Similarity
This year, the participants were challenged with new data sets for English, as well as the introduction of Spanish, as a new language in which to assess semantic similarity, and the annotations for both tasks leveraged crowdsourcing. Expand
A Corpus for Multilingual Document Classification in Eight Languages
A new subset of the Reuters corpus with balanced class priors for eight languages is proposed, adding Italian, Russian, Japanese and Chinese, which provides strong baselines for all language transfer directions using multilingual word and sentence embeddings respectively. Expand
XNLI: Evaluating Cross-lingual Sentence Representations
This work constructs an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus to 14 languages, including low-resource languages such as Swahili and Urdu and finds that XNLI represents a practical and challenging evaluation suite and that directly translating the test data yields the best performance among available baselines. Expand
What the [MASK]? Making Sense of Language-Specific BERT Models
The current state of the art in language-specific BERT models is presented, providing an overall picture with respect to different dimensions (i.e. architectures, data domains, and tasks), and an immediate and straightforward overview of the commonalities and differences are provided. Expand
Language Models are Unsupervised Multitask Learners
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations. Expand
SQuAD: 100,000+ Questions for Machine Comprehension of Text
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). Expand
Overview of CAPITEL Shared Tasks at IberLEF 2020: Named Entity Recognition and Universal Dependencies Parsing
The results of the CAPITEL-EVAL shared task, held in the context of the IberLEF 2020 competition series, consist on two subtasks: Named Entity Recognition and Classification and Universal Dependency parsing. Expand