Reducing the size of databases for multirelational classification: a subgraph-based approach

Authors:
Hongyu Guo;Herna L. Viktor;Eric Paquet
Affiliations:
National Research Council of Canada, Ottawa, Canada K1A 0R6;School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada K1N 6N5;National Research Council of Canada, Ottawa, Canada K1A 0R6 and School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada K1N 6N5
Venue:
Journal of Intelligent Information Systems
Year:
2013

Citing 45
Cited 0

Numerical recipes in C: the art of scientific computing

Numerical recipes in C: the art of scientific computing
Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
A practical approach to feature selection

ML92 Proceedings of the ninth international workshop on Machine learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Efficient sampling strategies for relational database operations

ICDT Selected papers of the 4th international conference on Database theory
Experiments on multistrategy learning by meta-learning

CIKM '93 Proceedings of the second international conference on Information and knowledge management
KOSI—an integrated system for discovering functional relations from databases

Journal of Intelligent Information Systems
Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

Machine Learning
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
A tutorial on learning with Bayesian networks

Proceedings of the NATO Advanced Study Institute on Learning in graphical models
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Relational Data Mining

Relational Data Mining
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Bottom-Up Association Rule Mining in Relational Databases

Journal of Intelligent Information Systems - Special issue on data warehousing and knowledge discovery
Using Correspondence Analysis to Combine Classifiers

Machine Learning
FOIL: A Midterm Report

ECML '93 Proceedings of the European Conference on Machine Learning
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Efficient Algorithms for Identifying Relevant Features

Efficient Algorithms for Identifying Relevant Features
Filtering Multi-Instance Problems to Reduce Dimensionality in Relational Learning

Journal of Intelligent Information Systems
Aggregation-based feature invention and relational concept classes

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning relational probability trees

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to the Special Issue on Meta-Learning

Machine Learning
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Pruning Social Networks Using Structural Properties and Descriptive Attributes

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Distribution-based aggregation for relational learning with identifier attributes

Machine Learning
Efficient Classification across Multiple Database Relations: A CrossMine Approach

IEEE Transactions on Knowledge and Data Engineering
Mining relational data through correlation-based multiple view validation

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Spatial associative classification: propositional vs structural approach

Journal of Intelligent Information Systems
Logical and Relational Learning: From ILP to MRDM (Cognitive Technologies)

Logical and Relational Learning: From ILP to MRDM (Cognitive Technologies)
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Semantic sampling of existing databases through informative Armstrong databases

Information Systems
Integrating Naïve Bayes and FOIL

The Journal of Machine Learning Research
Margin-based first-order rule learning

Machine Learning
A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Pruning Relations for Substructure Discovery of Multi-relational Databases

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Multirelational classification: a multiple view approach

Knowledge and Information Systems
Bellwether analysis: Searching for cost-effective query-defined predictors in large databases

ACM Transactions on Knowledge Discovery from Data (TKDD)
One in a million: picking the right patterns

Knowledge and Information Systems
Issues in stacked generalization

Journal of Artificial Intelligence Research
View learning for statistical relational learning: with an application to mammography

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Top-down induction of first-order logical decision trees

Artificial Intelligence
Fast learning of relational kernels

Machine Learning
Learning with many irrelevant features

AAAI'91 Proceedings of the ninth National conference on Artificial intelligence - Volume 2
A toolbox for learning from relational data with propositional and multi-instance learners

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data

Fundamenta Informaticae - Advances in Mining Graphs, Trees and Sequences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multirelational classification aims to discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a database often spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory management, financial management, and so on. When considering classification, different phases of the knowledge discovery process are affected by economic utility. For instance, in the data preprocessing process, one must consider the cost associated with acquiring, cleaning, and transforming large volumes of data. When training and testing the data mining models, one has to consider the impact of the data size on the running time of the learning algorithm. In order to address these utility-based issues, the paper presents an approach to create a pruned database for multirelational classification, while minimizing predictive performance loss on the final model. Our method identifies a set of strongly uncorrelated subgraphs from the original database schema, to use for training, and discards all others. The experiments performed show that our strategy is able to, without sacrificing predictive accuracy, significantly reduce the size of the databases, in terms of the number of relations, tuples, and attributes.The approach prunes the sizes of databases by as much as 94 %. Such reduction also results in decreasing computational cost of the learning process. The method improves the multirelational learning algorithms' execution time by as much as 80 %. In particular, our results demonstrate that one may build an accurate model with only a small subset of the provided database.