A unifying framework for greedy mining approximate top-k binary patterns and their evaluation

Abstract

A major mining task for binary matrixes is the extraction of approximate top-k patterns that are able to concisely describe the input data. The top-k pattern discovery problem is commonly stated as an optimization one, where the goal is to minimize a given cost function, e.g. the accuracy of the data description. In this work, we review several greedy algorithms, and also discuss PaNDa, an enhanced version of a previously proposed algorithm, which is able to greedily optimize several cost functions generalized into a unifying formulation. In evaluating the set of mined patterns, we aim at complementing the usual assessment methodology, which only measures the given cost function. Thus, we also evaluate how good are the models/patterns extracted in unveiling supervised knowledge on the data, i.e. the class labels of the data instances. We tested state-of-the-art algorithms and diverse cost functions on several datasets from the UCI repository. As expected, internal (cost function) and external (classification accuracy) indices of quality provide contrasting results. Nevertheless, PaNDa performs best, since the classifiers, built over the mined patterns used as record features, are in the majority of the cases the most accurate.

9 Figures and Tables

Cite this paper

@inproceedings{Lucchese2012AUF, title={A unifying framework for greedy mining approximate top-k binary patterns and their evaluation}, author={Claudio Lucchese and Salvatore Orlando and Raffaele Perego}, year={2012} }