• Corpus ID: 21724921

The brWaC Corpus: A New Open Resource for Brazilian Portuguese

@inproceedings{Wagner2018TheBC,
  title={The brWaC Corpus: A New Open Resource for Brazilian Portuguese},
  author={Jorge Wagner and Rodrigo Wilkens and Marco A P Idiart and Aline Villavicencio},
  booktitle={LREC},
  year={2018}
}
In this work, we present the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages. We also discuss our updated sentence-level approach for the strict removal of duplicated content. Following the pipeline methodology, more than 60 million pages were crawled and filtered, with 3.5 million being selected. The obtained multi-domain corpus, named brWaC, is composed by 2.7 billion tokens, and has been… 

Figures, Tables, and Topics from this paper

PassPort: A Dependency Parsing Model for Portuguese
TLDR
PassPort is introduced, a model for the dependency parsing of Portuguese trained with the Stanford Parser, which achieved very similar results for dependency parsing, with a LAS of 85.02 for PassPort against 84.36 for PALAVRAS.
Corpus-based Methodology for an Online Multilingual Collocations Dictionary: First Steps
This paper describes the first steps of a corpus-based methodology for the development of an online Platform for Multilingual Collocations Dictionaries (PLATCOL). The platform is aimed to be
DEEPAGÉ: Answering Questions in Portuguese about the Brazilian Environment
TLDR
This work introduces multiple QA systems that combine in novel ways the BM25 algorithm, a sparse retrieval technique, with PTT5, a pre-trained state-of-the-art language model, focusing on the Portuguese language, thus offering resources not found elsewhere in the literature.
Semantic Role Labeling in Portuguese: Improving the State of the Art with Transfer Learning and BERT-based Models
Semantic role labeling is the natural language processing task of determining "Who did what to whom", "when", "where", "how", etc. In this thesis, we explored state of the art techniques for this
Word Embedding Evaluation in Downstream Tasks and Semantic Analogies
TLDR
The results show that a diverse and comprehensive corpus can often outperform a larger, less textually diverse corpus, and that batch training may cause quality loss in THE AUTHORS models.
BERT for Sentiment Analysis: Pre-trained and Fine-Tuned Alternatives
TLDR
The experiments include BERT models trained with Brazilian Portuguese corpora and the multilingual version, contemplating multiple aggregation strategies and open-source datasets with predefined training, validation, and test partitions to facilitate the reproducibility of the results.
Building Web Corpora for Minority Languages
TLDR
A strategy for collecting textual material from the Internet for minority languages using web crawling combined with a language identification system and crowdsourcing before making sentence corpora out of the downloaded texts is described.
Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition
TLDR
The best NER system outperforms the state-of-the-art in Portuguese NER by 5.99 in absolute percentage points and a comparative study of 16 different combinations of shallow and contextual embeddings is shown.
BERTimbau: Pretrained BERT Models for Brazilian Portuguese
TLDR
This work trains BERT (Bidirectional Encoder Representations from Transformers) models for Brazilian Portuguese, which is nickname BERTimbau, and evaluates their models on three downstream NLP tasks: sentence textual similarity, recognizing textual entailment, and named entity recognition.
NILC-Metrix: assessing the complexity of written and spoken language in Brazilian Portuguese
TLDR
The potential of NILC-Metrix is illustrated by presenting three applications: a descriptive analysis of the differences between children’s film subtitles and texts written for Elementary School I1 and II (Final Years)2; a new predictor of textual complexity for the corpus of original and simplified texts of the PorSimples project; and a complexity prediction model for school grades, using transcripts of children's story narratives told by teenagers.
...
1
2
3
...

References

SHOWING 1-10 OF 28 REFERENCES
brWaC: A WaCky Corpus for Brazilian Portuguese
TLDR
The ongoing work on building brWaC, a massive Brazilian Portuguese corpus crawled from .br domains is presented, resulting in a tokenized and lemmatized corpus of 3 billion words.
Automatic Construction of Large Readability Corpora
TLDR
A framework for the automatic construction of large Web corpora classified by readability level is presented, including 1.7 million documents and about 1.6 billion tokens, already parsed and annotated with 134 different textual attributes, along with the agreement among the various classifiers.
The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
TLDR
UkWaC, deWaC and itWaC are introduced, three very large corpora of English, German, and Italian built by web crawling, and the methodology and tools used in their construction are described.
Crawling by Readability Level
TLDR
A framework for automatic generation of large corpora classified by readability is proposed, which adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level.
The TenTen Corpus Family
Everyone working on general language would like their corpus to be bigger, wider-coverage, cleaner, duplicate-free, and with richer metadata. In this paper we describe out programme to build ever
Efficient corpus development for lexicography: building the New Corpus for Ireland
TLDR
A new, register-diverse, 55-million-word bilingual corpus—the New Corpus for Ireland (NCI)—is developed to support the creation of a new English-to-Irish dictionary, and it is believed to have a good model for corpus creation for lexicography.
LX-DSemVectors: Distributional Semantics Models for Portuguese
TLDR
The creation and distribution of the first publicly available word embeddings for Portuguese are described and evaluated on their own and also compared with the original English models on a well-known analogy task.
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
TLDR
This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language.
Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction
TLDR
Key to the approach is to first acquire the various senses of an ambiguous query and then cluster the search results based on their semantic similarity to the word senses induced, which outperforms both Web clustering and search engines.
A WaCky Introduction
We use the Web today for a myriad purposes, from buying a plane ticket to browsing an ancient manuscript, from looking up a recipe to watching a TV program. And more. Besides these “proper” uses,
...
1
2
3
...