Efficiently Determining the Starting Sample Size for Progressive Sampling

Authors:
Baohua Gu;Bing Liu;Feifang Hu;Huan Liu
Affiliations:
-;-;-;-
Venue:
EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Year:
2001

Citing 15
Cited 2

Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension

COLT '91 Proceedings of the fourth annual workshop on Computational learning theory
C4.5: programs for machine learning

C4.5: programs for machine learning
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A framework for measuring changes in data characteristics

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A sequential sampling algorithm for a general class of utility criteria

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Data Mining: An Overview from a Database Perspective

IEEE Transactions on Knowledge and Data Engineering
Toward an Ecplanatory Similarity Measure for Nearest-Neighbor Classification

ECML '00 Proceedings of the 11th European Conference on Machine Learning
The Effects of Training Set Size on Decision Tree Complexity

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

DS '99 Proceedings of the Second International Conference on Discovery Science

Sampling-Based Relative Landmarks: Systematically Test-Driving Algorithms Before Choosing

EPIA '01 Proceedings of the10th Portuguese Conference on Artificial Intelligence on Progress in Artificial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving
The research of sampling for mining frequent itemsets

RSKT'06 Proceedings of the First international conference on Rough Sets and Knowledge Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a large data set and a classification learning algorithm, Progressive Sampling (PS) uses increasingly larger random samples to learn until model accuracy no longer improves. It is shown that the technique is remarkably efficient compared to using the entire data. However, how to set the starting sample size for PS is still an open problem. We show that an improper starting sample size can still make PS expensive in computation due to running the learning algorithm on a large number of instances (of a sequence of random samples before achieving convergence) and excessive database scans to fetch the sample data. Using a suitable starting sample size can further improve the efficiency of PS. In this paper, we present a statistical approach which is able to efficiently find such a size. We call it the Statistical Optimal Sample Size (SOSS), in the sense that a sample of this size sufficiently resembles the entire data. We introduce an information-based measure of this resemblance (Sample Quality) to define the SOSS and show that it can be efficiently obtained in one scan of the data. We prove that learning on a sample of SOSS will produce model accuracy that asymptotically approaches the highest achievable accuracy on the entire data. Empirical results on a number of large data sets from the UCIKDD repository show that SOSS is a suitable starting size for Progressive Sampling.