C4.5: programs for machine learning
C4.5: programs for machine learning
The power of sampling in knowledge discovery
PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
The nature of statistical learning theory
The nature of statistical learning theory
Rigorous learning curve bounds from statistical mechanics
Machine Learning - Special issue on COLT '94
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
A sequential sampling algorithm for a general class of utility criteria
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Statistical learning control of uncertain systems: theory and algorithms
Applied Mathematics and Computation
The Effects of Training Set Size on Decision Tree Complexity
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Journal of Artificial Intelligence Research
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Selective Rademacher Penalization and Reduced Error Pruning of Decision Trees
The Journal of Machine Learning Research
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Hi-index | 0.00 |
Sampling can enhance processing of large training example databases, but without knowing all of the data, or the example producing process, it is impossible to know in advance what size of a sample to choose in order to guarantee good performance. Progressive sampling has been suggested to circumvent this problem. The idea in it is to increase the sample size according to some schedule until accuracy close to that which would be obtained using all of the data is reached. How to determine this stopping time efficiently and accurately is a central difficulty in progressive sampling.We study stopping time determination by approximating the generalization error of the hypothesis rather than by assuming the often observed shape for the learning curve and trying to detect whether the final plateau has been reached in the curve. We use data dependent generalization error bounds. Instead of using the common cross validation approach, we use the recently introduced Rademacher penalties, which have been observed to give good results on simple concept classes.We experiment with two-level decision trees built by the learning algorithm T2. It finds a hypothesis with the minimal error with respect to the sample. The theoretically well motivated stopping time determination based on Rademacher penalties gives results that are much closer to those attained using heuristics based on assumptions on learning curve shape than distribution independent estimates based on VC dimension do.