The learning-curve sampling method applied to model-based clustering

Authors:
Christopher Meek;Bo Thiesson;David Heckerman
Affiliations:
Microsoft Research, One Microsoft Way, Redmond, WA;Microsoft Research, One Microsoft Way, Redmond, WA;Microsoft Research, One Microsoft Way, Redmond, WA
Venue:
The Journal of Machine Learning Research
Year:
2002

Citing 5
Cited 11

A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
SEER: maximum likelihood regression for learning-speed curves

SEER: maximum likelihood regression for learning-speed curves
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning

Mining complex models from arbitrarily large databases in constant time

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
iVIBRATE: Interactive visualization-based framework for clustering large datasets

ACM Transactions on Information Systems (TOIS)
Online Random Shuffling of Large Database Tables

IEEE Transactions on Knowledge and Data Engineering
Efficient sampling of training set in large and noisy multimedia data

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
“Best K”: critical clustering structures in categorical datasets

Knowledge and Information Systems
Mining in Large Noisy Domains

Journal of Data and Information Quality (JDIQ)
Extending fuzzy and probabilistic clustering to very large data sets

Computational Statistics & Data Analysis
HE-Tree: a framework for detecting changes in clustering structure for categorical data streams

The VLDB Journal — The International Journal on Very Large Data Bases
A dynamic adaptive sampling algorithm (DASA) for real world applications: finger print recognition and face recognition

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient jaccard-based diversity analysis of large document collections

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We examine the learning-curve sampling method, an approach for applying machine-learning algorithms to large data sets. The approach is based on the observation that the computational cost of learning a model increases as a function of the sample size of the training data, whereas the accuracy of a model has diminishing improvements as a function of sample size. Thus, the learning-curve sampling method monitors the increasing costs and performance as larger and larger amounts of data are used for training, and terminates learning when future costs outweigh future benefits. In this paper, we formalize the learning-curve sampling method and its associated cost-benefit tradeoff in terms of decision theory. In addition, we describe the application of the learning-curve sampling method to the task of model-based clustering via the expectation-maximization (EM) algorithm. In experiments on three real data sets, we show that the learning-curve sampling method produces models that are nearly as accurate as those trained on complete data sets, but with dramatically reduced learning times. Finally, we describe an extension of the basic learning-curve approach for model-based clustering that results in an additional speedup. This extension is based on the observation that the shape of the learning curve for a given model and data set is roughly independent of the number of EM iterations used during training. Thus, we run EM for only a few iterations to decide how many cases to use for training, and then run EM to full convergence once the number of cases is selected.