RainForest—A Framework for Fast Decision Tree Construction of Large Datasets

Authors:
Johannes Gehrke;Raghu Ramakrishnan;Venkatesh Ganti
Affiliations:
Department of Computer Sciences, University of Wisconsin-Madison;Department of Computer Sciences, University of Wisconsin-Madison;Department of Computer Sciences, University of Wisconsin-Madison
Venue:
Data Mining and Knowledge Discovery
Year:
2000

Citing 36
Cited 36

Classification algorithms

Classification algorithms
Approximating the number of unique values of an attribute without sorting

Information Systems
Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems

Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems
Symbolic and Neural Learning Algorithms: An Experimental Comparison

Machine Learning
On changing continuous attributes into ordered discrete attributes

EWSL-91 Proceedings of the European working session on learning on Machine learning
Designing Storage Efficient Decision Trees

IEEE Transactions on Computers
On the induction of decision trees for multiple concept learning

On the induction of decision trees for multiple concept learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Experiments on multistrategy learning by meta-learning

CIKM '93 Proceedings of the second international conference on Information and knowledge management
Efficient agnostic PAC-learning with simple hypothesis

COLT '94 Proceedings of the seventh annual conference on Computational learning theory
Machine learning, neural and statistical classification

Machine learning, neural and statistical classification
Mining business databases

Communications of the ACM
The data warehouse and data mining

Communications of the ACM
Mining scientific data

Communications of the ACM
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Approximate Algorithms for the 0/1 Knapsack Problem

Journal of the ACM (JACM)
Fast Approximation Algorithms for the Knapsack and Sum of Subset Problems

Journal of the ACM (JACM)
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Genetic Algorithms in Search, Optimization and Machine Learning

Genetic Algorithms in Search, Optimization and Machine Learning
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Pattern Recognition and Neural Networks

Pattern Recognition and Neural Networks
Self-Organizing Maps

Self-Organizing Maps
Induction of Decision Trees

Machine Learning
Database Mining: A Performance Perspective

IEEE Transactions on Knowledge and Data Engineering
The Power of Decision Tables

ECML '95 Proceedings of the 8th European Conference on Machine Learning
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Algorithms for Mining Association Rules for Binary Segmentations of Huge Categorical Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
An Interval Classifier for Database Mining Applications

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Chi2: Feature Selection and Discretization of Numeric Attributes

TAI '95 Proceedings of the Seventh International Conference on Tools with Artificial Intelligence
Multivariate Versus Univariate Decision Trees

Multivariate Versus Univariate Decision Trees
On growing better decision trees from data

On growing better decision trees from data

E-business enterprise data mining

Tutorial notes of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient C4.5

IEEE Transactions on Knowledge and Data Engineering
On the quest for easy-to-understand splitting rules

Data & Knowledge Engineering
Decision Trees for Multiple Abstraction Levels of Data

CIA '01 Proceedings of the 5th International Workshop on Cooperative Information Agents V
ART: A Hybrid Classification Model

Machine Learning
Turning CARTwheels: an alternating algorithm for mining redescriptions

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Building multi-way decision trees with numerical attributes

Information Sciences: an International Journal
Machine learning: a review of classification and combining techniques

Artificial Intelligence Review
Comparing mathematical and heuristic approaches for scientific data analysis

Artificial Intelligence for Engineering Design, Analysis and Manufacturing
Efficient online mining of large databases

International Journal of Business Information Systems
An Efficient and Sensitive Decision Tree Approach to Mining Concept-Drifting Data Streams

Informatica
DataJewel: Integrating Visualization with Temporal Data Mining

Visual Data Mining
Entropy-based associative classification algorithm for mining manufacturing data

International Journal of Computer Integrated Manufacturing
Minimum-effort driven dynamic faceted search in structured databases

Proceedings of the 17th ACM conference on Information and knowledge management
Blind paraunitary equalization

Signal Processing
Multirelational classification: a multiple view approach

Knowledge and Information Systems
Self-tuning query mesh for adaptive multi-route query processing

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Supervised Machine Learning: A Review of Classification Techniques

Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies
A fast decision tree learning algorithm

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
Spatio-temporal mining for power load forecasting in GIS-AMR load analysis model

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
A decision-tree-based symbolic rule induction system for text categorization

IBM Systems Journal
Boosting lite: handling larger datasets and slower base classifiers

MCS'07 Proceedings of the 7th international conference on Multiple classifier systems
Class-oriented reduction of decision tree complexity

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
TMiner aspects: Crosscutting concerns in the TMiner component-based data mining framework

Expert Systems with Applications: An International Journal
Porting decision tree algorithms to multicore using fastflow

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I
Parallel boosted regression trees for web search ranking

Proceedings of the 20th international conference on World wide web
ComEnVprs: a novel approach for inducing decision tree classifiers

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Multivariate decision trees using different splitting attribute subsets for large datasets

AI'10 Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence
An Efficient Method for Discretizing Continuous Attributes

International Journal of Data Warehousing and Mining
Decision trees: a recent overview

Artificial Intelligence Review
A survey on concept drift adaptation

ACM Computing Surveys (CSUR)
Software quality assessment using a multi-strategy classifier

Information Sciences: an International Journal
Component-based decision trees for classification

Intelligent Data Analysis
Building fast decision trees from large training sets

Intelligent Data Analysis
Variable precision rough set based decision tree classifier

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology - Hybrid approaches for approximate reasoning
A hybrid decision tree classifier

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algorithms in terms of quality. In this paper, we present a unifying framework called Rain Forest for classification tree construction that separates the scalability aspects of algorithms for constructing a tree from the central features that determine the quality of the tree. The generic algorithm is easy to instantiate with specific split selection methods from the literature (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, SPRINT and QUEST).In addition to its generality, in that it yields scalable versions of a wide range of classification algorithms, our approach also offers performance improvements of over a factor of three over the SPRINT algorithm, the fastest scalable classification algorithm proposed previously. In contrast to SPRINT, however, our generic algorithm requires a certain minimum amount of main memory, proportional to the set of distinct values in a column of the input relation. Given current main memory costs, this requirement is readily met in most if not all workloads.