Incremental linear model trees on massive datasets: keep it simple, keep it fast

Authors:
Andreas Hapfelmeier;Jana Schmidt;Stefan Kramer
Affiliations:
Technische Universtiät München, Garching, Germany;Technische Universtiät München, Garching, Germany;Johannes Gutenberg-Universität Mainz, Mainz, Germany
Venue:
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Year:
2013

Citing 8
Cited 0

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Employing linear regression in regression tree leaves

ECAI '92 Proceedings of the 10th European conference on Artificial intelligence
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Loadstar: load shedding in data stream mining

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Incremental Learning of Linear Model Trees

Machine Learning
A simple regression based heuristic for learning model trees

Intelligent Data Analysis
Learning Model Trees from Data Streams

DS '08 Proceedings of the 11th International Conference on Discovery Science
Learning model trees from evolving data streams

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

The existence of massive datasets raises the need for algorithms that make efficient use of resources like memory and computation time. Besides well-known approaches such as sampling, online algorithms are being recognized as good alternatives, as they often process datasets faster using much less memory. The important class of algorithms learning linear model trees online (incremental linear model trees or ILMTs in the following) offers interesting options for regression tasks in this sense. However, surprisingly little is known about their performance, as there exists no large-scale evaluation on massive stationary datasets under equal conditions. Therefore, this paper shows their applicability on massive stationary datasets under various parameter settings. To reduce biases arising from the choice of a programming language or programming skills, all algorithms were reimplemented within the same framework and tested under the same conditions. Results on real-world datasets indicate that for massive stationary datasets parameter settings leading to complex models do not pay off, as there is at most a small accuracy gain at a much larger running time. Experimental evidence suggests that simple and fast algorithms perform best.