BOAT—optimistic decision tree construction

Authors:
Johannes Gehrke;Venkatesh Ganti;Raghu Ramakrishnan;Wei-Yin Loh
Affiliations:
Department of Computer Sciences and Department of Statistics, University of Wisconsin-Madison;Department of Computer Sciences and Department of Statistics, University of Wisconsin-Madison;Department of Computer Sciences and Department of Statistics, University of Wisconsin-Madison;Department of Computer Sciences and Department of Statistics, University of Wisconsin-Madison
Venue:
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Year:
1999

Citing 16
Cited 89

Machine learning, neural and statistical classification

Machine learning, neural and statistical classification
Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Mining optimized association rules for numeric attributes

PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Decision Tree Induction Based on Efficient Tree Restructuring

Machine Learning
Incremental Induction of Decision Trees

Machine Learning
Induction of Decision Trees

Machine Learning
Database Mining: A Performance Perspective

IEEE Transactions on Knowledge and Data Engineering
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Algorithms for Mining Association Rules for Binary Segmentations of Huge Categorical Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
An Interval Classifier for Database Mining Applications

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
On growing better decision trees from data

On growing better decision trees from data

Classification and regression: money *can* grow on trees

KDD '99 Tutorial notes of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable algorithms for mining large databases

KDD '99 Tutorial notes of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining and the Web: past, present and future

Proceedings of the 2nd international workshop on Web information and data management
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient algorithms for constructing decision trees with constraints

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Targeting the right students using data mining

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering through decision tree construction

Proceedings of the ninth international conference on Information and knowledge management
On computing correlated aggregates over continual data streams

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Scalable data mining with model constraints

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Mining time-changing data streams

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
A streaming ensemble algorithm (SEA) for large-scale classification

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
SQL database primitives for decision tree classifiers

Proceedings of the tenth international conference on Information and knowledge management
Mining data streams under block evolution

ACM SIGKDD Explorations Newsletter
MobiMine: monitoring the stock market from a PDA

ACM SIGKDD Explorations Newsletter
DEMON: Mining and Monitoring Evolving Data

IEEE Transactions on Knowledge and Data Engineering
Efficient C4.5

IEEE Transactions on Knowledge and Data Engineering
On the quest for easy-to-understand splitting rules

Data & Knowledge Engineering
Efficiently Determining the Starting Sample Size for Progressive Sampling

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Knowledge Management in Expert System Creator

AIMSA '02 Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications
Shared Memory Parallelization of Decision Tree Construction Using a General Data Mining Middleware

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Building an Information and Knowledge Fusion System

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
Decision Trees for Multiple Abstraction Levels of Data

CIA '01 Proceedings of the 5th International Workshop on Cooperative Information Agents V
Efficient Data Mining by Active Learning

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
On effective classification of strings with wavelets

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Instability of decision tree classification algorithms

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
B-EM: a classifier incorporating bootstrap with EM approach for data mining

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining and monitoring evolving data

Handbook of massive data sets
Cancer classification using gene expression data

Information Systems - Special issue: Data management in bioinformatics
Scoring and ranking the data using association rules

Data mining, rough sets and granular computing
Is random model better? On its accuracy and efficiency

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Prototype-based mining of numeric data streams

Proceedings of the 2003 ACM symposium on Applied computing
Mining concept-drifting data streams using ensemble classifiers

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient decision tree construction on streaming data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
ART: A Hybrid Classification Model

Machine Learning
A Fourier Spectrum-Based Approach to Represent Decision Trees for Mining Data Streams in Mobile Environments

IEEE Transactions on Knowledge and Data Engineering
Discovering decision rules from numerical data streams

Proceedings of the 2004 ACM symposium on Applied computing
Incremental, Online, and Merge Mining of Partial Periodic Patterns in Time-Series Databases

IEEE Transactions on Knowledge and Data Engineering
Automatic categorization of query results

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Genetic programming in classifying large-scale data: an ensemble method

Information Sciences: an International Journal - Special issue: Soft computing data mining
Building multi-way decision trees with numerical attributes

Information Sciences: an International Journal
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

IEEE Transactions on Knowledge and Data Engineering
On the Use of Wavelet Decomposition for String Classification

Data Mining and Knowledge Discovery
Hierarchical Decision Tree Induction in Distributed Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
XRules: An effective algorithm for structural classification of XML data

Machine Learning
A Framework for On-Demand Classification of Evolving Data Streams

IEEE Transactions on Knowledge and Data Engineering
Suppressing model overfitting in mining concept-drifting data streams

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Incorporating large unlabeled data to enhance EM classification

Journal of Intelligent Information Systems
Vector and matrix operations programmed with UDFs in a relational DBMS

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Addressing diverse user preferences in SQL-query-result navigation

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Building statistical models and scoring with UDFs

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Reverse nearest neighbor aggregates over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Workload-aware anonymization techniques for large-scale datasets

ACM Transactions on Database Systems (TODS)
Blind paraunitary equalization

Signal Processing
A New Incremental Algorithm for Induction of Multivariate Decision Trees for Large Datasets

IDEAL '08 Proceedings of the 9th International Conference on Intelligent Data Engineering and Automated Learning
A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Parameter Estimation in Semi-Random Decision Tree Ensembling on Streaming Data

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Applying enhanced data mining approaches in predicting bank performance: A case of Taiwanese commercial banks

Expert Systems with Applications: An International Journal
Cancer classification using microarray and layered architecture genetic programming

Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers
Concept Drifting Detection on Noisy Streaming Data in Random Ensemble Decision Trees

MLDM '09 Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition
Inductive learning in less than one sequential data scan

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
General criteria on building decision trees for data classification

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
An ensemble approach applied to classify spam e-mails

Expert Systems with Applications: An International Journal
HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

The VLDB Journal — The International Journal on Very Large Data Bases
PLANET: massively parallel learning of tree ensembles with MapReduce

Proceedings of the VLDB Endowment
Association rule mining in multiple, multidimensional time series medical data

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Discovering conjecturable rules through tree-based clustering analysis

Expert Systems with Applications: An International Journal
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

Data & Knowledge Engineering
A Streaming Parallel Decision Tree Algorithm

The Journal of Machine Learning Research
Mining distributed evolving data streams using fractal GP ensembles

EuroGP'07 Proceedings of the 10th European conference on Genetic programming
sIDMG: small-size intrusion detection model generation of complimenting decision tree classification algorithm

WISA'06 Proceedings of the 7th international conference on Information security applications: PartI
Database implementation of a model-free classifier

ADBIS'07 Proceedings of the 11th East European conference on Advances in databases and information systems
BOAI: fast alternating decision tree induction based on bottom-up evaluation

PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
A granular agent evolutionary algorithm for classification

Applied Soft Computing
The inverse classification problem

Journal of Computer Science and Technology
A novel sequential design strategy for global surrogate modeling

Winter Simulation Conference
Effective sentiment stream analysis with self-augmenting training and demand-driven projection

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A Novel Hybrid Sequential Design Strategy for Global Surrogate Modeling of Computer Experiments

SIAM Journal on Scientific Computing
Mining Recurring Concept Drifts with Limited Labeled Streaming Data

ACM Transactions on Intelligent Systems and Technology (TIST)
Evaluation of summarization schemes for learning in streams

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Multivariate decision trees using different splitting attribute subsets for large datasets

AI'10 Proceedings of the 23rd Canadian conference on Advances in Artificial Intelligence
Induction of decision trees using an internal control of induction

IWANN'05 Proceedings of the 8th international conference on Artificial Neural Networks: computational Intelligence and Bioinspired Systems
Scalable random forests for massive data

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection

Applied Soft Computing
Decision tree selection in an industrial machine fault diagnostics

MEDI'12 Proceedings of the 2nd international conference on Model and Data Engineering
An Efficient Method for Discretizing Continuous Attributes

International Journal of Data Warehousing and Mining
Unlearning from demonstration

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Building fast decision trees from large training sets

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree.We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the “real” tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost.Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely rebuilding the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.