Luís Sarmento

Learn More
We present a corpus-based approach to the class expansion task. For a given set of seed entities we use co-occurrence statistics taken from a text collection to define a membership function that is used to rank candidate entities for inclusion in the set. We describe an evaluation framework that uses data from Wikipedia. The performance of our class(More)
We present a multi-pass clustering approach to large scale, wide-scope named-entity disambiguation (NED) on collections of web pages. Our approach uses name co-occurrence information to cluster and hence disambiguate entities, and is designed to handle NED on the entire web. We show that on web collections, NED becomes increasingly difficult as the corpus(More)
We present a new method for automatically enlarging a sentiment lexicon for mining social judgments from text, i.e., extracting opinions about human subjects. We use a two-step approach: first, we find which adjectives can be used as human modifiers, and then we assign their polarity attribute. To identify the human modifiers, we developed a set of(More)
• Rule sets tend to grow exceedingly, especially if: – the NER task involves more than the “traditional” entities. Ex: “9th Symphony”, “Alzheimer’s disease” – the text to be analyzed is not just “well-behaved” newspaper text. Ex: web-pages, blogs, etc. • Developers usually end up with a large rule set – difficult to maintain – difficult to debug – difficult(More)
The automatic processing of microblogging messages may be problematic, even in the case of very elementary operations such as tokenization. The problems arise from the use of non-standard language, including media-specific words (e.g. "2day", "gr8", "tl;dr", "loool"), emoticons (e.g. "(ò_ó)", "(=^-^=)"), non-standard letter casing (e.g. "dr.(More)
We investigate the expression of opinions about human entities in user-generated content (UGC). A set of 2,800 online news comments (8,000 sentences) was manually annotated, following a rich annotation scheme designed for this purpose. We conclude that the challenge in performing opinion mining in such type of content is correctly identifying the positive(More)
In this paper we will present Corpógrafo, a mature web-based environment for working with corpora, for terminology extraction, and for ontology development. We will explain Corpógrafo’s workflow and describe the most important information extraction methods used, namely its term extraction, and definition / semantic relations identification procedures. We(More)
In this paper we propose a set of stylistic markers for automatically attributing authorship to micro-blogging messages. The proposed markers include highly personal and idiosyncratic editing options, such as ‘emoticons’, interjections, punctuation, abbreviations and other low-level features. We evaluate the ability of these features to help discriminate(More)