Condensed representations for data mining

Abstract

INTRODUCTION Condensed representations have been proposed in (Mannila & Toivonen, 1996) as a useful concept for the optimization of typical data mining tasks. It appears as a key concept Raedt, 2002) and this paper introduces this research domain, its achievements in the context of frequent itemset mining (FIM) from transactional data and its future trends. Within the inductive database framework, knowledge discovery processes are considered as querying processes. Inductive databases (IDBs) contain not only data, but also patterns. In an IDB, ordinary queries can be used to access and manipulate data, while inductive queries can be used to generate (mine), manipulate, and apply patterns. To motivate the need for condensed representations, let us start from the simple model proposed in (Mannila & Toivonen, 1997). Many data mining tasks can be abstracted into the computation of a theory. Given a language L of patterns (e.g., itemsets), a database instance r (e.g., a transactional database) and a selection predicate q which specifies whether a given pattern is interesting or not (e.g., the itemset is frequent in r), a data mining task can be formalized as the computation of Th(L,q,r) = {φ ∈ L | q(φ,r) is true}. This can be also considered as the evaluation for the inductive query q. Notice that it specifies that every pattern which satisfies q has to be computed. This completeness assumption is quite common for local pattern discovery tasks but is generally not acceptable for more complex tasks (e.g., accuracy optimization for predictive model mining). The selection predicate q can be defined in terms of a Boolean expression over some primitive constraints (e.g., a minimal frequency constraint used in conjunction with a syntactic constraint which enforces the presence or the absence of some sub-patterns). Some of the primitive constraints generally refer to the " behavior " of a pattern in the data by using the so-called evaluation functions (e.g. frequency). To support the whole knowledge discovery process, it is important to support the computation of many different but correlated theories. It is well known that a " generate and test " approach that would enumerate the sentences of L and then test the selection predicate q is generally impossible. A huge effort has been made by data mining researchers to make an active use of the primitive constraints occurring in q to achieve a tractable evaluation of useful mining queries. It is the domain of constraint-based …

Cite this paper

@inproceedings{Boulicaut2004CondensedRF, title={Condensed representations for data mining}, author={Jean-François Boulicaut}, year={2004} }