Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Employing linear regression in regression tree leaves
ECAI '92 Proceedings of the 10th European conference on Artificial intelligence
The power of sampling in knowledge discovery
PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Loadstar: load shedding in data stream mining
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Incremental Learning of Linear Model Trees
Machine Learning
A simple regression based heuristic for learning model trees
Intelligent Data Analysis
Learning Model Trees from Data Streams
DS '08 Proceedings of the 11th International Conference on Discovery Science
Learning model trees from evolving data streams
Data Mining and Knowledge Discovery
Hi-index | 0.00 |
The existence of massive datasets raises the need for algorithms that make efficient use of resources like memory and computation time. Besides well-known approaches such as sampling, online algorithms are being recognized as good alternatives, as they often process datasets faster using much less memory. The important class of algorithms learning linear model trees online (incremental linear model trees or ILMTs in the following) offers interesting options for regression tasks in this sense. However, surprisingly little is known about their performance, as there exists no large-scale evaluation on massive stationary datasets under equal conditions. Therefore, this paper shows their applicability on massive stationary datasets under various parameter settings. To reduce biases arising from the choice of a programming language or programming skills, all algorithms were reimplemented within the same framework and tested under the same conditions. Results on real-world datasets indicate that for massive stationary datasets parameter settings leading to complex models do not pay off, as there is at most a small accuracy gain at a much larger running time. Experimental evidence suggests that simple and fast algorithms perform best.