Crawling by Readability Level

  title={Crawling by Readability Level},
  author={Jorge Wagner and Rodrigo Wilkens and Leonardo Zilio and Marco A P Idiart and Aline Villavicencio},
The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, the Web is increasingly being used by researchers as a source of written content to build very large and rich corpora, in the Web as Corpus (WaC) initiative. This paper proposes a framework for automatic generation of large corpora classified by readability. It adopts a supervised learning method to incorporate a readability filter based in features with low… 
Automatic Construction of Large Readability Corpora
A framework for the automatic construction of large Web corpora classified by readability level is presented, including 1.7 million documents and about 1.6 billion tokens, already parsed and annotated with 134 different textual attributes, along with the agreement among the various classifiers.
The brWaC Corpus: A New Open Resource for Brazilian Portuguese
In this work, we present the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages. We also discuss our
Text complexity of open educational resources in Portuguese: mixing written and spoken registers in a multi-task approach
This paper presents a study on text complexity of Open Educational Resources (OER) in Brazilian Portuguese. In a data analysis of the Brazilian Ministry of Education Integrated Platform (MEC-RED)
CM 2 News : Towards a Corpus for Multilingual Multi-Document Summarization
This paper describes the ongoing construction of CM2News, a semantic-annotated corpus for fostering research on multilingual multidocument summarization. The corpus comprises 20 clusters of news
PassPort: A Dependency Parsing Model for Portuguese
PassPort is introduced, a model for the dependency parsing of Portuguese trained with the Stanford Parser, which achieved very similar results for dependency parsing, with a LAS of 85.02 for PassPort against 84.36 for PALAVRAS.
LexSubNC: A Dataset of Lexical Substitution for Nominal Compounds
A lexical substitution dataset for Portuguese nominal compounds, where a significant effect of compositionality is found in the use of one of the component words (head or modifier) as a substitute.
Cross-Lingual Induction and Transfer of Verb Classes Based on Word Vector Space Specialisation
This work proposes a novel cross-lingual transfer method for inducing VerbNets for multiple languages, and is the first study which demonstrates how the architectures for learning word embeddings can be applied to this challenging syntactic-semantic task.
A Lexical Simplification Tool for Promoting Health Literacy
An authoring tool that combines Natural Language Processing, Corpus Linguistics and Terminology to help writers to convert health-related information into a more accessible version for people with low literacy skills is presented.


On The Applicability of Readability Models to Web Texts
Applying the readability models and the features they are based on to web search results finds that the average reading level of the retrieved web documents is relatively high, supporting the potential usefulness of readability ranking for the web.
The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
UkWaC, deWaC and itWaC are introduced, three very large corpora of English, German, and Italian built by web crawling, and the methodology and tools used in their construction are described.
brWaC: A WaCky Corpus for Brazilian Portuguese
The ongoing work on building brWaC, a massive Brazilian Portuguese corpus crawled from .br domains is presented, resulting in a tokenized and lemmatized corpus of 3 billion words.
Revisiting the Readability Assessment of Texts in Portuguese
This paper presents experiments to build a readability checker to classify texts in Portuguese, considering different text genres, domains and reader ages, using naturally occurring texts.
A machine learning approach to reading level assessment
This paper uses support vector machines to combine features from n-gram language models, parses, and traditional reading level measures to produce a better method of assessing reading level, and explores ways that multiple human annotations can be used in comparative assessments of system performance.
A Comparison of Features for Automatic Readability Assessment
It is found that features based on in-domain language models have the highest predictive power and Entity-density and POS-features, in particular nouns, are individually very useful but highly correlated.
Do NLP and machine learning improve traditional readability formulas?
This paper compares an emerging paradigm which uses sophisticated NLP-enabled features and machine learning techniques with an existing approach to readability formulas, finding the new readability formula performed better than the "classic" formula.
Coh-Metrix: Analysis of text on cohesion and language
Standard text readability formulas scale texts on difficulty by relying on word length and sentence length, whereas Coh-Metrix is sensitive to cohesion relations, world knowledge, and language and discourse characteristics.
An open-source rule-based syllabification tool for Brazilian Portuguese
The proposed tool is based on published rule-based algorithms, with some new proposals, especially in the treatment of words with diphthongs and hiatus, and shows the percentage of correctly syllabified words of 99%.
BabelNet: Building a Very Large Multilingual Semantic Network
A very large, wide-coverage multilingual semantic network that integrates lexicographic and encyclopedic knowledge from WordNet and Wikipedia and Machine Translation is also applied to enrich the resource with lexical information for all languages.