• Corpus ID: 9831081

Automatic Construction of Large Readability Corpora

  title={Automatic Construction of Large Readability Corpora},
  author={Jorge Wagner and Rodrigo Wilkens and Aline Villavicencio},
  booktitle={CL4LC@COLING 2016},
This work presents a framework for the automatic construction of large Web corpora classified by readability level. We compare different Machine Learning classifiers for the task of readability assessment focusing on Portuguese and English texts, analysing the impact of variables like the feature inventory used in the resulting corpus. In a comparison between shallow and deeper features, the former already produce F-measures of over 0.75 for Portuguese texts, but the use of additional features… 

Tables and Topics from this paper

The brWaC Corpus: A New Open Resource for Brazilian Portuguese
In this work, we present the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages. We also discuss our
PassPort: A Dependency Parsing Model for Portuguese
PassPort is introduced, a model for the dependency parsing of Portuguese trained with the Stanford Parser, which achieved very similar results for dependency parsing, with a LAS of 85.02 for PassPort against 84.36 for PALAVRAS.
What if the whole is greater than the sum of the parts? Modelling Complex (Multiword) Expressions (invited paper)
Results obtained with the use of contextualised word representation models, which have been successfully used for capturing different word usages, and therefore could provide an attractive alternative for representing idiomaticity in language are presented.


Crawling by Readability Level
A framework for automatic generation of large corpora classified by readability is proposed, which adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level.
The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
UkWaC, deWaC and itWaC are introduced, three very large corpora of English, German, and Italian built by web crawling, and the methodology and tools used in their construction are described.
Revisiting the Readability Assessment of Texts in Portuguese
This paper presents experiments to build a readability checker to classify texts in Portuguese, considering different text genres, domains and reader ages, using naturally occurring texts.
Readability Classification for German using Lexical, Syntactic, and Morphological Features
It is shown that readability classification for German based on syntactic, lexical and language model features from previous research on English is highly successful, reaching 89.7% accuracy, with the new morphological features making an important contribution.
Simple or Complex? Assessing the readability of Basque Texts
A readability assessment system for Basque, ErreXail, is presented, which is going to be the preprocessing module of a Text Simplification system, and it detects the features that perform best and the most predictive ones.
An open-source rule-based syllabification tool for Brazilian Portuguese
The proposed tool is based on published rule-based algorithms, with some new proposals, especially in the treatment of words with diphthongs and hiatus, and shows the percentage of correctly syllabified words of 99%.
brWaC: A WaCky Corpus for Brazilian Portuguese
The ongoing work on building brWaC, a massive Brazilian Portuguese corpus crawled from .br domains is presented, resulting in a tokenized and lemmatized corpus of 3 billion words.
A comparative study of classifier combination applied to NLP tasks
This study explored the performance of a number of combination methods such as voting, Bayesian merging, behavior knowledge space, bagging, stacking, feature sub-spacing and cascading, for the part-of-speech tagging task using nine corpora in five languages and believes it is the most exhaustive comparison made with combination methods applied to NLP tasks so far.
Do NLP and machine learning improve traditional readability formulas?
This paper compares an emerging paradigm which uses sophisticated NLP-enabled features and machine learning techniques with an existing approach to readability formulas, finding the new readability formula performed better than the "classic" formula.
A machine learning approach to reading level assessment
This paper uses support vector machines to combine features from n-gram language models, parses, and traditional reading level measures to produce a better method of assessing reading level, and explores ways that multiple human annotations can be used in comparative assessments of system performance.