Scalable Discovery of Informative Structural Concepts Using Domain Knowledge

Authors:
Diane J. Cook;Lawrence B. Holder;Surnjani Djoko
Affiliations:
-;-;-
Venue:
IEEE Expert: Intelligent Systems and Their Applications
Year:
1996

Citing 6
Cited 8

Inferring decision trees using the minimum description length principle

Information and Computation
Concept formation in structured domains

Concept formation knowledge and experience in unsupervised learning
Machine Discovery of Protein Motifs

Machine Learning - Special issue on applications in molecular biology
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
The role of domain knowledge in substructure discovery

The role of domain knowledge in substructure discovery

Structural knowledge discovery in chemical and spatio-temporal databases

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Graph-Based Data Mining

IEEE Intelligent Systems
Direct Domain Knowledge Inclusion in the PA3 Rule Induction Algorithm

PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
Graph-based hierarchical conceptual clustering

The Journal of Machine Learning Research
Discovering knowledge in DNA and protein data

ACM SIGBIO Newsletter - Special issue on biomedical applications of knowledge discovery in databases
Using Evolutionary Algorithms for Defining the Sampling Policy of Complex N-Partite Networks

IEEE Transactions on Knowledge and Data Engineering
Learning patterns in the dynamics of biological networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Comparative study of pattern mining techniques for network management system logs for convergent network

ICDEM'10 Proceedings of the Second international conference on Data Engineering and Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Subdue system evaluates the benefits of using domain knowledge to guide the discovery of repetitive, functional substructures in large structural databases. Results show that domain-specific knowledge improves the search for such substructures and enables greater data compression.The increasing amount and complexity of today's data creates an urgent need to accelerate discovery of knowledge in large databases. In response, designers have developed numerous approaches for discovering concepts in databases using a linear, attribute-value representation. These approaches address issues of data relevance, missing data, noise, and domain knowledge. However, much of the data collected is structural in nature or composed of parts and relations between the parts. Hence, there is a need for scalable tools to analyze and discover concepts in structural databases. Many reported discovery tools are also computationally expensive and cannot scale easily to large databases, especially those containing structural information.Recently, we introduced a method for discovering substructures in structural databases using the minimum description length (MDL) principle. The system, called Subdue, discovers substructures that compress the input database and represent structural concepts. Once Subdue discovers a substructure, the system simplifies the data by replacing instances of the substructure with a pointer to the substructure definition. The discovered substructures allow abstraction over detailed structures in the original data. Iteration of the substructure discovery and replacement process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that users can access based on the specific goals of the data analysis.In this article, we focus on how to realize the benefits of domain-dependent discovery approaches by adding domain-specific knowledge to a domain-independent discovery system. We also evaluate the benefits and costs of using domain-specific information. In particular, we measure the performance of the Subdue system with and without domain-specific knowledge along the performance dimensions of compression, the time needed to discover the substructures, and the usefulness of the discovered substructures. Finally, we address the issue of scalability of structure discovery using Subdue. On the basis of scalability tests we've conducted, we highlight features of databases that can affect Subdue's performance.