• Publications
  • Influence
A Framework for Clustering Evolving Data Streams
A fundamentally different philosophy for data stream clustering is discussed which is guided by application-centered requirements and uses the concepts of a pyramidal time frame in conjunction with a microclustering approach. Expand
Mining sequential patterns by pattern-growth: the PrefixSpan approach
  • J. Pei, Jiawei Han, +5 authors M. Hsu
  • Computer Science
  • IEEE Transactions on Knowledge and Data…
  • 1 November 2004
This paper proposes a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns, and shows that PrefixSpan outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE and is the fastest among all the tested algorithms. Expand
BIDE: efficient mining of frequent closed sequences
BIDE is an efficient algorithm for mining frequent closed sequences without candidate maintenance, which adopts a novel sequence closure checking scheme called bidirectional extension, and prunes the search space more deeply compared to the previous algorithms by using the BackScan pruning method and the Scan-Skip optimization technique. Expand
A dirichlet multinomial mixture model-based approach for short text clustering
This paper proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering and found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. Expand
Comparing Stars: On Approximating Graph Edit Distance
Three novel methods to compute the upper and lower bounds for the edit distance between two graphs in polynomial time are introduced and result shows that these methods achieve good scalability in terms of both the number of graphs and the size of graphs. Expand
CLOSET+: searching for the best strategies for mining frequent closed itemsets
CLOSET+ integrates the advantages of the previously proposed effective strategies as well as some ones newly developed here, and develops a winning algorithm CLOSET+. Expand
EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data
An extended inverted index is proposed to facilitate keyword-based search, and a novel ranking mechanism for enhancing search effectiveness is presented, which achieves both high search efficiency and high accuracy. Expand
A Framework for Projected Clustering of High Dimensional Data Streams
This paper proposes a new, high-dimensional, projected data stream clustering method, called HPStream, which incorporates a fading cluster structure, and the projection based clustering methodology, and achieves better clustering quality in comparison with the previous stream clusters. Expand
Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions
A thorough overview and analysis of the main approaches to entity linking is presented, and various applications, the evaluation of entity linking systems, and future directions are discussed. Expand
Frequent pattern mining with uncertain data
This paper will show how broad classes of algorithms can be extended to the uncertain data setting, and study candidate generate-and-test algorithms, hyper-structure algorithms and pattern growth based algorithms. Expand