Rigorous learning curve bounds from statistical mechanics
COLT '94 Proceedings of the seventh annual conference on Computational learning theory
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Guide to Neural Computing Applications
Guide to Neural Computing Applications
Machine Learning
Progressive rademacher sampling
Eighteenth national conference on Artificial intelligence
The learning-curve sampling method applied to model-based clustering
The Journal of Machine Learning Research
Finding the most interesting patterns in a database quickly by using sequential sampling
The Journal of Machine Learning Research
Hi-index | 0.00 |
In many real world problems, data mining algorithms have access to massive amounts of data (defense and security). Mining all the available data is prohibitive due to computational (time and memory) constraints. Thus, the smallest sufficient training set size that obtains the same accuracy as the entire available dataset remains an important research question. Progressive sampling randomly selects an initial small sample and increases the sample size using either geometric or arithmetic series until the error converges, with the sampling schedule determined apriori. In this paper, we explore sampling schedules that are adaptive to the dataset under consideration. We develop a general approach to determine how many instances are required at each iteration for convergence using Chernoff Inequality. We try our approach on two real world problems where data is abundant: face recognition and finger print recognition using neural networks. Our empirical results show that our dynamic approach is faster and uses much fewer examples than other existing methods. However, the use of Chernoff bound requires the samples at each iteration to be independent of each other. Future work will look at removing this limitation which should further improve performance.