Learn More
Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a(More)
This paper is concerned with the class imbalance problem which has been known to hinder the learning performance of classification algorithms. The problem occurs when there are significantly less number of observations of the target concept. Various real-world classification tasks, such as medical diagnosis, text categorization and fraud detection suffer(More)
A formal type of scientific and academic collaboration is coauthorship which can be represented by a coauthorship network. Coauthorship networks are among some of the largest social networks and offer us the opportunity to study the mechanisms underlying large-scale real world networks. We construct such a network for the Computer Science field covering(More)
Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage(More)
The K-nearest neighbor (KNN) decision rule has been a ubiquitous classification tool with good scalability. Past experience has shown that the optimal choice of K depends upon the data, making it laborious to tune the parameter for different applications. We introduce a new metric that measures the informativeness of objects to be classified. When applied(More)
We describe the SEERLAB system that participated in the SemEval 2010's Keyphrase Extraction Task. SEERLAB utilizes the DBLP corpus for generating a set of candidate keyphrases from a document. Random Forest, a supervised ensemble classifier, is then used to select the top keyphrases from the candidate set. SEERLAB achieved a 0.24 F-score in generating the(More)
It has been widely observed that search queries are composed in a very different style from that of the body or the title of a document. Many techniques explicitly accounting for this language style discrepancy have shown promising results for information retrieval, yet a large scale analysis on the extent of the language differences has been lacking. In(More)
We propose a novel multiclass classification algorithm Gentle Adaptive Multiclass Boosting Learning (GAMBLE). The algorithm naturally extends the two class Gentle AdaBoost algorithm to multiclass classification by using the multi-class exponential loss and the multiclass response encoding scheme. Unlike other multiclass algorithms which reduce the K-class(More)