Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension
COLT '91 Proceedings of the fourth annual workshop on Computational learning theory
C4.5: programs for machine learning
C4.5: programs for machine learning
The power of sampling in knowledge discovery
PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A framework for measuring changes in data characteristics
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
BOAT—optimistic decision tree construction
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A sequential sampling algorithm for a general class of utility criteria
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Methods for Scaling Up Inductive Algorithms
Data Mining and Knowledge Discovery
Data Mining: An Overview from a Database Perspective
IEEE Transactions on Knowledge and Data Engineering
Toward an Ecplanatory Similarity Measure for Nearest-Neighbor Classification
ECML '00 Proceedings of the 11th European Conference on Machine Learning
The Effects of Training Set Size on Decision Tree Complexity
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An Information-Theoretic Definition of Similarity
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms
DS '99 Proceedings of the Second International Conference on Discovery Science
Sampling-Based Relative Landmarks: Systematically Test-Driving Algorithms Before Choosing
EPIA '01 Proceedings of the10th Portuguese Conference on Artificial Intelligence on Progress in Artificial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving
The research of sampling for mining frequent itemsets
RSKT'06 Proceedings of the First international conference on Rough Sets and Knowledge Technology
Hi-index | 0.00 |
Given a large data set and a classification learning algorithm, Progressive Sampling (PS) uses increasingly larger random samples to learn until model accuracy no longer improves. It is shown that the technique is remarkably efficient compared to using the entire data. However, how to set the starting sample size for PS is still an open problem. We show that an improper starting sample size can still make PS expensive in computation due to running the learning algorithm on a large number of instances (of a sequence of random samples before achieving convergence) and excessive database scans to fetch the sample data. Using a suitable starting sample size can further improve the efficiency of PS. In this paper, we present a statistical approach which is able to efficiently find such a size. We call it the Statistical Optimal Sample Size (SOSS), in the sense that a sample of this size sufficiently resembles the entire data. We introduce an information-based measure of this resemblance (Sample Quality) to define the SOSS and show that it can be efficiently obtained in one scan of the data. We prove that learning on a sample of SOSS will produce model accuracy that asymptotically approaches the highest achievable accuracy on the entire data. Empirical results on a number of large data sets from the UCIKDD repository show that SOSS is a suitable starting size for Progressive Sampling.