Substructure Discovery Using Minimum Description Length and Background Knowledge

  title={Substructure Discovery Using Minimum Description Length and Background Knowledge},
  author={Diane Joyce Cook and Lawrence B. Holder},
The ability to identify interesting and repetitive substructures is an essential component to discovering knowledge in structural data. We describe a new version of our SUBDUE substructure discovery system based on the minimum description length principle. The SUBDUE system discovers substructures that compress the original data and represent structural concepts in the data. By replacing previously-discovered substructures in the data, multiple passes of SUBDUE produce a hierarchical… 

Figures from this paper

Substucture Discovery in the SUBDUE System
The SUBDUE system, which uses the minimum description length (MDL) principle to discover substructures that compress the database and represent structural concepts in the data, is described.
An Emprirical Study of Domain Knowledge and Its Benefits to Substructure Discovery
Results show that domain specific knowledge improves the search for substructures that are useful to the domain and leads to greater compression of the data.
Subdue: compression-based frequent pattern discovery in graph data
The graph-based data mining system Subdue is described which focuses on the discovery of sub-graphs which are not only frequent but also compress the graph dataset, using a heuristic algorithm.
Finding the most descriptive substructures in graphs with discrete and numeric labels
This paper explores the relationship between graph structure and the distribution of attribute values and proposes an outlier-detection step, which is used as a constraint during substructure discovery and applies to multi-dimensional numeric attributes.
Approaches to Parallel Graph-Based Knowledge Discovery
This research investigates approaches for scaling a particular knowledge discovery?data mining system, Subdue, using parallel and distributed resources, and potential achievements and obstacles are discussed.
Structural Pattern Recognition in Graphs
This chapter describes an approach to discovering patterns in relational data represented as a graph based on the minimum description length (MDL) principle, which measures how well various patterns compress the original database.
Discovering Substructures in the Chemical Toxicity Domain
The researcher’s ability to interpret the data and discover interesting patterns within the data is of great importance as it helps in obtaining relevant SARs and identifying conceptually interesting substructures that enhance the interpretation of data.
Structure Discovery from Sequential Data
I-Subdue is described, an extension to the Subdue graph-based data mining system that operates over sequentially received relational data to incrementally discover the most representative substructures to overcome the challenge of locally optimal substructure overshadowing those that are globally optimal.
Coupling Two Complementary Knowledge Discovery Systems
This work investigates a simpler integration of the two systems by coupling the two approaches by first executing the structural discovery s}~tem on the data, and then uses these results to augment or compress the data before being input to the attribute-value-based system.
Exploiting parallelism in knowledge discovery systems to improve scalability
  • G. GalalD. CookL. Holder
  • Computer Science
    Proceedings of the Thirty-First Hawaii International Conference on System Sciences
  • 1998
This research outlines a general approach for scaling KDD systems using parallel and distributed resources and applies the suggested strategies to the SUBDUE knowledge discovery system.


Discovery of Inexact Concepts from Structural Data
An implementation of the authors' SUBDUE system that employs an inexact graph match to discover substructures which occur often in the data, but not always in the same form, is described.
A Minimal Encoding Approach to Feature Discovery
This paper discusses unsupervised learning of orthogonal concepts on relational data, which demands a much larger search space than traditional concept learning algorithms, and requires that the concepts be interpretable by a human, an ability not yet realized with connectionist algorithms.
Graph Clustering and Model Learning by Data Compression
Unifying Learning Methods by Colored Digraphs
A graph-based induction algorithm that extracts typical patterns from colored digraphs that enables the uniform treatment of the above two learning tasks to solve complex learning problems such as the construction of hierarchical knowledge bases.
Learning Engineering Models with the Minimum Description Length Principle
The minimum description length principle, together with the KEDS algorithm, is used to guide the partitioning of the problem space and has been tested on discovering models for predicting the performance efficiencies of an internal combustion engine.
Grammatical Inference Based on Hyperedge Replacement
The main result is a characterization of the inferred grammars as “samples-composing” meaning that each sample can be derived and each rule contributes to the generation of samples in a certain way.
A Self-Organizing Retrieval System for Graphs
The design of a general knowledge base for labeled graphs is presented. The design involves a partial ordering of graphs represented as subsets of nodes of a universal graph. The knowledge base's