• Publications
  • Influence
The brWaC Corpus: A New Open Resource for Brazilian Portuguese
In this work, we present the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages. We also discuss our
B2SG: a TOEFL-like Task for Portuguese
The BabelNet-Based Semantic Gold Standard (B2SG) was automatically constructed based on BabelNet and partly evaluated by human judges and can be used as the basis for evaluating the accuracy of the similarity relations on distributional thesauri.
Size Does Not Matter. Frequency Does. A Study of Features for Measuring Lexical Complexity
Interestingly, the results show that word length is not important, while corpus frequency is enough to correctly classify a large proportion of the test cases (F-measure over 80 %).
Using NLP for Enhancing Second Language Acquisition
This study presents SMILLE, a system that draws on the Noticing Hypothesis and on input enhancements, addressing the lack of salience of grammatical infor mation in online documents chosen by a given
Automatic Construction of Large Readability Corpora
A framework for the automatic construction of large Web corpora classified by readability level is presented, including 1.7 million documents and about 1.6 billion tokens, already parsed and annotated with 134 different textual attributes, along with the agreement among the various classifiers.
Enhancing Grammatical Structures in Web-Based Texts.
The SMILLE system is presented, a system that uses Natural Language Processing for enhancing grammatical information in texts chosen by a given user and is designed to draw the users’ attention to specific grammatical structures and thus help them to notice their occurrence in authentic contexts.
PassPort: A Dependency Parsing Model for Portuguese
PassPort is introduced, a model for the dependency parsing of Portuguese trained with the Stanford Parser, which achieved very similar results for dependency parsing, with a LAS of 85.02 for PassPort against 84.36 for PALAVRAS.
An SLA Corpus Annotated with Pedagogically Relevant Grammatical Structures
This study automatically annotated a corpus of texts produced by language learners with pedagogically relevant grammatical structures and observed how these structures are being employed by learners from different proficiency levels.
Crawling by Readability Level
A framework for automatic generation of large corpora classified by readability is proposed, which adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level.
Coreference-Based Text Simplification
This paper presents a rule-based system for automatic text simplification, aiming at adapting French texts for dyslexic children, and takes into account not only lexical and syntactic but also discourse information, based on coreference chains.