SEER: maximum likelihood regression for learning-speed curves
SEER: maximum likelihood regression for learning-speed curves
Bayesian classification (AutoClass): theory and results
Advances in knowledge discovery and data mining
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Mining complex models from arbitrarily large databases in constant time
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
iVIBRATE: Interactive visualization-based framework for clustering large datasets
ACM Transactions on Information Systems (TOIS)
Online Random Shuffling of Large Database Tables
IEEE Transactions on Knowledge and Data Engineering
Efficient sampling of training set in large and noisy multimedia data
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
“Best K”: critical clustering structures in categorical datasets
Knowledge and Information Systems
Journal of Data and Information Quality (JDIQ)
Extending fuzzy and probabilistic clustering to very large data sets
Computational Statistics & Data Analysis
HE-Tree: a framework for detecting changes in clustering structure for categorical data streams
The VLDB Journal — The International Journal on Very Large Data Bases
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Stratified k-means clustering over a deep web data source
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient jaccard-based diversity analysis of large document collections
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
We examine the learning-curve sampling method, an approach for applying machine-learning algorithms to large data sets. The approach is based on the observation that the computational cost of learning a model increases as a function of the sample size of the training data, whereas the accuracy of a model has diminishing improvements as a function of sample size. Thus, the learning-curve sampling method monitors the increasing costs and performance as larger and larger amounts of data are used for training, and terminates learning when future costs outweigh future benefits. In this paper, we formalize the learning-curve sampling method and its associated cost-benefit tradeoff in terms of decision theory. In addition, we describe the application of the learning-curve sampling method to the task of model-based clustering via the expectation-maximization (EM) algorithm. In experiments on three real data sets, we show that the learning-curve sampling method produces models that are nearly as accurate as those trained on complete data sets, but with dramatically reduced learning times. Finally, we describe an extension of the basic learning-curve approach for model-based clustering that results in an additional speedup. This extension is based on the observation that the shape of the learning curve for a given model and data set is roughly independent of the number of EM iterations used during training. Thus, we run EM for only a few iterations to decide how many cases to use for training, and then run EM to full convergence once the number of cases is selected.