Handling numeric attributes in hoeffding trees

Authors:
Bernhard Pfahringer;Geoffrey Holmes;Richard Kirkby
Affiliations:
University of Waikato, Hamilton, New Zealand;University of Waikato, Hamilton, New Zealand;University of Waikato, Hamilton, New Zealand
Venue:
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2008

Citing 12
Cited 8

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations

Communications of the ACM
Approximate medians and other quantiles in one pass and with limited memory

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Computing standard deviations: accuracy

Communications of the ACM
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Database Mining: A Performance Perspective

IEEE Transactions on Knowledge and Data Engineering
A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Accurate decision trees for mining high-speed data streams

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Forest trees for on-line data

Proceedings of the 2004 ACM symposium on Applied computing
Improved use of continuous attributes in C4.5

Journal of Artificial Intelligence Research
Stress-testing hoeffding trees

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

Decision Tree Induction from Numeric Data Stream

AI '08 Proceedings of the 21st Australasian Joint Conference on Artificial Intelligence: Advances in Artificial Intelligence
MOA: Massive Online Analysis

The Journal of Machine Learning Research
Learning model trees from evolving data streams

Data Mining and Knowledge Discovery
Kernel-based selective ensemble learning for streams of trees

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Learning very fast decision tree from uncertain data streams with positive and unlabeled samples

Information Sciences: an International Journal
Learning from data streams with only positive and unlabeled data

Journal of Intelligent Information Systems
A lossy counting based approach for learning on streams of graphs on a budget

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Data stream mining for predicting software build outcomes using source code metrics

Information and Software Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

For conventional machine learning classification algorithms handling numeric attributes is relatively straightforward. Unsupervised and supervised solutions exist that either segment the data into predefined bins or sort the data and search for the best split points. Unfortunately, none of these solutions carry over particularly well to a data stream environment. Solutions for data streams have been proposed by several authors but as yet none have been compared empirically. In this paper we investigate a range of methods for multi-class tree-based classification where the handling of numeric attributes takes place as the tree is constructed. To this end, we extend an existing approximation approach, based on simple Gaussian approximation. We then compare this method with four approaches from the literature arriving at eight final algorithm configurations for testing. The solutions cover a range of options from perfectly accurate and memory intensive to highly approximate. All methods are tested using the Hoeffding tree classification algorithm. Surprisingly, the experimental comparison shows that the most approximate methods produce the most accurate trees by allowing for faster tree growth.