An integrated approach for scaling up classification and prediction algorithms for data mining

Authors:
Patricia E. N. Lutu
Affiliations:
University of Pretoria
Venue:
SAICSIT '02 Proceedings of the 2002 annual research conference of the South African institute of computer scientists and information technologists on Enablement through technology
Year:
2002

Citing 11
Cited 1

On the Handling of Continuous-Valued Attributes in Decision Tree Generation

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Technical note: some properties of splitting criteria

Machine Learning
General and Efficient Multisplitting of Numerical Attributes

Machine Learning
Data mining: concepts and techniques

Data mining: concepts and techniques
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Induction of Decision Trees

Machine Learning
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Improved use of continuous attributes in C4.5

Journal of Artificial Intelligence Research

CAKE - Classifying, Associating and Knowledge DiscovEry - An Approach for Distributed Data Mining (DDM) Using PArallel Data Mining Agents (PADMAs)

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification and prediction algorithms for machine learning typically require all training data to be resident in memory during decision tree construction. Typically, a flat file is created from database or data warehouse data and loaded into memory for processing. This severely limits the scalability of these algorithms to practical data mining tasks. Some attempts have been made by researchers to implement disk-based algorithms which can handle much larger training sets. Both approaches suffer from three serious limitations. The first limitation is that a significant amount of the original data must be duplicated on disk. The second limitation is that these algorithms are unable to utilize the computational capabilities of the data warehouse or database system. The computation of sums, counts and averages are some of the operations that database / data warehouse systems can do very efficiently. The third limitation is that these algorithms produce very inflexible decision trees which cannot be manipulated by the analyst. This is because they fail to take advantage of those features of data cube technology that enables analysts to view data at different levels of abstraction. This paper proposes a data mining approach that removes the need to copy data from a data warehouse or database. The approach also facilitates On-line Analytical Mining (OLAM) as it integrates database and data warehouse queries with decision tree construction.