An integrated approach for scaling up classification and prediction algorithms for data mining

  • Authors:
  • Patricia E. N. Lutu

  • Affiliations:
  • University of Pretoria

  • Venue:
  • SAICSIT '02 Proceedings of the 2002 annual research conference of the South African institute of computer scientists and information technologists on Enablement through technology
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Classification and prediction algorithms for machine learning typically require all training data to be resident in memory during decision tree construction. Typically, a flat file is created from database or data warehouse data and loaded into memory for processing. This severely limits the scalability of these algorithms to practical data mining tasks. Some attempts have been made by researchers to implement disk-based algorithms which can handle much larger training sets. Both approaches suffer from three serious limitations. The first limitation is that a significant amount of the original data must be duplicated on disk. The second limitation is that these algorithms are unable to utilize the computational capabilities of the data warehouse or database system. The computation of sums, counts and averages are some of the operations that database / data warehouse systems can do very efficiently. The third limitation is that these algorithms produce very inflexible decision trees which cannot be manipulated by the analyst. This is because they fail to take advantage of those features of data cube technology that enables analysts to view data at different levels of abstraction. This paper proposes a data mining approach that removes the need to copy data from a data warehouse or database. The approach also facilitates On-line Analytical Mining (OLAM) as it integrates database and data warehouse queries with decision tree construction.