Reducing the size of databases for multirelational classification: a subgraph-based approach

  • Authors:
  • Hongyu Guo;Herna L. Viktor;Eric Paquet

  • Affiliations:
  • National Research Council of Canada, Ottawa, Canada K1A 0R6;School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada K1N 6N5;National Research Council of Canada, Ottawa, Canada K1A 0R6 and School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada K1N 6N5

  • Venue:
  • Journal of Intelligent Information Systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multirelational classification aims to discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a database often spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory management, financial management, and so on. When considering classification, different phases of the knowledge discovery process are affected by economic utility. For instance, in the data preprocessing process, one must consider the cost associated with acquiring, cleaning, and transforming large volumes of data. When training and testing the data mining models, one has to consider the impact of the data size on the running time of the learning algorithm. In order to address these utility-based issues, the paper presents an approach to create a pruned database for multirelational classification, while minimizing predictive performance loss on the final model. Our method identifies a set of strongly uncorrelated subgraphs from the original database schema, to use for training, and discards all others. The experiments performed show that our strategy is able to, without sacrificing predictive accuracy, significantly reduce the size of the databases, in terms of the number of relations, tuples, and attributes.The approach prunes the sizes of databases by as much as 94 %. Such reduction also results in decreasing computational cost of the learning process. The method improves the multirelational learning algorithms' execution time by as much as 80 %. In particular, our results demonstrate that one may build an accurate model with only a small subset of the provided database.