Learn More
We present a corpus-based approach to the class expansion task. For a given set of seed entities we use co-occurrence statistics taken from a text collection to define a membership function that is used to rank candidate entities for inclusion in the set. We describe an evaluation framework that uses data from Wikipedia. The performance of our class(More)
The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons (e.g. "(ò_ó)", "(=^-^=)"), non-standard letter casing (e.g. "dr.(More)
Modern social network analysis relies on vast quantities of data to infer new knowledge about human relations and communication. In this paper we describe TwitterEcho, an open source Twitter crawler for supporting this kind of research, which is characterized by a modular distributed architecture. Our crawler enables researchers to continuously collect data(More)
We present a multi-pass clustering approach to large scale, wide-scope named-entity disambiguation (NED) on collections of web pages. Our approach uses name co-occurrence information to cluster and hence disambiguate entities, and is designed to handle NED on the entire web. We show that on web collections , NED becomes increasingly difficult as the corpus(More)
In this paper we describe REPENTINO, a publicly available gazetteer intended to help the development of named entity recognition systems for Portuguese. REPENTINO wishes to minimize the problems developers face due to the limited availability of this type of lexical-semantic resources for Portuguese. The data stored in REPENTINO was mostly extracted from(More)
In this paper we present RAMA (Relational Artist MAps), a simple yet efficient interface to navigate through networks of music artists. RAMA is built upon a dataset of artist similarity and user-defined tags regarding 583.000 artists gathered from Last.fm. This third-party, publicly available, data about artists similarity and artists tags is used to(More)
In this Late-breaking/Demo submission, we present a completely revamped version of the RAMA interactive artist network visualization prototype. This version keeps most features from previous versions, and additionally to an improved graphical design 1 and an improved compatibility with diverse Operating Systems and web browsers, it adds a number of features(More)
It is difficult to determine the country of origin of the author of a short message based only on the text. This is an even more complex problem when more than one country uses the same native language. In this paper, we address the specific problem of detecting the two main variants of the Portuguese language --- European and Brazilian --- in Twitter(More)