A Streaming Parallel Decision Tree Algorithm

Authors:
Yael Ben-Haim;Elad Tom-Tov
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2010

Citing 17
Cited 8

The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations

Communications of the ACM
Approximate medians and other quantiles in one pass and with limited memory

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
An on-line agglomerative clustering method for nonstationary data

Neural Computation
On the boosting ability of top-down decision tree learning algorithms

Journal of Computer and System Sciences
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Parallel Formulations of Decision-Tree Classification Algorithms

Data Mining and Knowledge Discovery
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Parallel Out-of-Core Divide-and-Conquer Techniques with Application to Classification Trees

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Approximation and streaming algorithms for histogram construction problems

ACM Transactions on Database Systems (TODS)
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Continuously maintaining order statistics over data streams: extended abstract

ADC '07 Proceedings of the eighteenth conference on Australasian database - Volume 63
How to summarize the universe: dynamic maintenance of quantiles

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Parallel boosted regression trees for web search ranking

Proceedings of the 20th international conference on World wide web
Scalable random forests for massive data

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Hierarchical linear support vector machine

Pattern Recognition
Unexpected challenges in large scale machine learning

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
HC-CART: A parallel system implementation of data mining classification and regression tree (CART) algorithm on a multi-FPGA system

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Parallel approaches to machine learning-A comprehensive survey

Journal of Parallel and Distributed Computing
Let us know your decision: Pool-based active training of a generative classifier with the selection strategy 4DS

Information Sciences: an International Journal
Building fast decision trees from large training sets

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm's accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy.