Learn More
This paper considers the problem of publishing "transaction data" for research purposes. Each transaction is an arbitrary set of items chosen from a large universe. Detailed transaction data provides an electronic image of one's life. This has two implications. One, transaction data are excellent candidates for data mining research. Two, use of transaction(More)
Indexing microblogs for real-time search is challenging given the efficiency issue caused by the tremendous speed at which new microblogs are created by users. Existing approaches address this efficiency issue at the cost of query accuracy, as they either (i) exclude a significant portion of microblogs from the index to reduce update cost or (ii) rank(More)
Mining frequent patterns, including mining frequent closed patterns or maximal patterns, is a fundamental and important problem in data mining area. Many algorithms adopt the pattern growth approach, which is shown to be superior to the candidate generate-and-test approach, especially when long patterns exist in the datasets. In this paper, we identify the(More)
Biosequences typically have a small alphabet, a long length, and patterns containing gaps (i.e., "don't care") of arbitrary size. Mining frequent patterns in such sequences faces a different type of explosion than in transaction sequences primarily motivated in market-basket analysis. In this paper, we study how this explosion affects the classic sequential(More)
Mining frequent patterns is a fundamental and important problem in many data mining applications. Many of the algorithms adopt the pattern growth approach, which is shown to be superior to the candidate generate-and-test approach significantly. In this paper, we identify the key factors that influence the performance of the pattern growth approach, and(More)
Group based anonymization is the most widely studied approach for privacy-preserving data publishing. Privacy models/definitions using group based anonymization includes <i>k</i>-anonymity, <i>l</i>-diversity, and <i>t</i>-closeness, to name a few. The goal of this article is to raise a fundamental issue regarding the privacy exposure of the approaches(More)
We consider the problem of publishing sensitive transaction data with privacy preservation. High dimensionality of transaction data poses unique challenges on data privacy and data utility. On one hand, re-identification attacks tend to use a subset of items that infrequently occur in transactions, called moles. On the other hand, data mining applications(More)
In this paper, we propose a new framework for mining frequent patterns from large transactional databases. The core of the framework is of a novel coded prefix-path tree with two representations, namely, a memory-based prefix-path tree and a disk-based prefix-path tree. The disk-based prefix-path tree is simple in its data structure yet rich in information(More)
While previous works on privacy-preserving serial data publishing consider the scenario where sensitive values may persist over multiple data releases, we find that no previous work has sufficient protection provided for sensitive values that can change over time, which should be the more common case. In this work, we propose to study the privacy guarantee(More)