A Parallel Scalable Infrastructure for OLAP and Data Mining

  • Authors:
  • Sanjay Goil;Alok Choudhary

  • Affiliations:
  • -;-

  • Venue:
  • IDEAS '99 Proceedings of the 1999 International Symposium on Database Engineering & Applications
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Decision support systems are important in leveraging information present in data warehouses in businesses like banking, insurance, retail and health-care among many others. The multi-dimensional aspects of a business can be naturally expressed using a multi-dimensional data model. Data analysis and data mining on these warehouses pose new challenges for traditional database systems. OLAP and data mining operations require summary information on these multi-dimensional data sets. Query processing for these applications require different views of data for analysis and effective decision making. Data mining techniques can be applied in conjunction with OLAP for an integrated business solution. As data warehouses grow, parallel processing techniques have been applied to enable the use of larger data sets and reduce the time for analysis, thereby enabling evaluation of many more options for decision making.In this paper we address (1) scalability in multi-dimensional systems for OLAP and multi-dimensional analysis, (2) integration of data mining with the OLAP framework, and (3) high performance by using parallel processing for OLAP and data mining. We describe our system PARSIMONY - Parallel and Scalable Infrastructure for Multidimensional Online analytical processing. This platform is used both for OLAP and data mining. Sparsity of data sets is handled by using sparse chunks using a bit-encoded sparse structure for compression, which enables aggregate operations on compressed data. Techniques for effectively using summary information available in data cubes for data mining are presented for mining Association rules and decision-tree based Classification. These take advantage of the data organization provided by the multidimensional data model.Performance results for high dimensional data sets on a distributed memory parallel machine (IBM SP-2) show good speedup and scalability.