• Publications
  • Influence
Disambiguating Web appearances of people in a social network
This paper presents two unsupervised frameworks for solving this problem: one based on link structure of the Web pages, another using Agglomerative/CongLomerative Double Clustering (A/CDC)---an application of a recently introduced multi-way distributional clustering method. Expand
Distributional Word Clusters vs. Words for Text Categorization
An approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier with a word-cluster representation is studied, which significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. Expand
Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora
An extensive benchmark study of email foldering using two large corpora of real-world email messages and foldering schemes: one from former Enron employees, another from participants in an SRI research project. Expand
Scaling up machine learning: parallel and distributed approaches
This tutorial gives a broad view of modern approaches for scaling up machine learning and data mining methods on parallel/distributed platforms and provides an integrated overview of state-of-the-art platforms and algorithm choices. Expand
Extracting social networks and contact information from email and the Web
An end-to-end system that extracts a user's social network and its members' contact information given the user's email inbox and discusses the capabilities of the system for address book population, expert-finding, and social network analysis. Expand
Using Bigrams in Text Categorization
In the past decade a sufficient effort has been expended on attempting to come up with a document representation which is richer than the simple Bag-Of-Words (BOW). One of the widely exploredExpand
Multi-way distributional clustering via pairwise interactions
An extensive empirical study of two-way, three-way and four-way applications of the MDC scheme using six real-world datasets including the 20 News-groups and the Enron email collection shows that the algorithms consistently and significantly outperform previous state-of-the-art information theoretic clustering algorithms. Expand
On feature distributional clustering for text categorization
This work describes a text categorization approach that is based on a combination of feature distributional clusters with a support vector machine (SVM) classifier that yields high performance text classification that can outperform other recent methods in terms of categorization accuracy and representation efficiency. Expand