Scalable Discovery of Informative Structural Concepts Using Domain Knowledge

  • Authors:
  • Diane J. Cook;Lawrence B. Holder;Surnjani Djoko

  • Affiliations:
  • -;-;-

  • Venue:
  • IEEE Expert: Intelligent Systems and Their Applications
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Subdue system evaluates the benefits of using domain knowledge to guide the discovery of repetitive, functional substructures in large structural databases. Results show that domain-specific knowledge improves the search for such substructures and enables greater data compression.The increasing amount and complexity of today's data creates an urgent need to accelerate discovery of knowledge in large databases. In response, designers have developed numerous approaches for discovering concepts in databases using a linear, attribute-value representation. These approaches address issues of data relevance, missing data, noise, and domain knowledge. However, much of the data collected is structural in nature or composed of parts and relations between the parts. Hence, there is a need for scalable tools to analyze and discover concepts in structural databases. Many reported discovery tools are also computationally expensive and cannot scale easily to large databases, especially those containing structural information.Recently, we introduced a method for discovering substructures in structural databases using the minimum description length (MDL) principle. The system, called Subdue, discovers substructures that compress the input database and represent structural concepts. Once Subdue discovers a substructure, the system simplifies the data by replacing instances of the substructure with a pointer to the substructure definition. The discovered substructures allow abstraction over detailed structures in the original data. Iteration of the substructure discovery and replacement process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that users can access based on the specific goals of the data analysis.In this article, we focus on how to realize the benefits of domain-dependent discovery approaches by adding domain-specific knowledge to a domain-independent discovery system. We also evaluate the benefits and costs of using domain-specific information. In particular, we measure the performance of the Subdue system with and without domain-specific knowledge along the performance dimensions of compression, the time needed to discover the substructures, and the usefulness of the discovered substructures. Finally, we address the issue of scalability of structure discovery using Subdue. On the basis of scalability tests we've conducted, we highlight features of databases that can affect Subdue's performance.