Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Authors:
Carlos Domingo;Ricard Gavaldà;Osamu Watanabe
Affiliations:
-;-;-
Venue:
DS '99 Proceedings of the Second International Conference on Discovery Science
Year:
1999

Citing 10
Cited 6

Efficient sampling strategies for relational database operations

ICDT Selected papers of the 4th international conference on Database theory
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
An introduction to computational learning theory

An introduction to computational learning theory
Query size estimation by adaptive sampling

Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
On the boosting ability of top-down decision tree learning algorithms

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
An Algorithm for Multi-relational Discovery of Subgroups

PKDD '97 Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Practical Algorithms for On-line Sampling

DS '98 Proceedings of the First International Conference on Discovery Science
Scalable Mining for Classification Rules in Relational Databases

IDEAS '98 Proceedings of the 1998 International Symposium on Database Engineering & Applications

Efficiently Determining the Starting Sample Size for Progressive Sampling

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Sequential Sampling Algorithms: Unified Analysis and Lower Bounds

SAGA '01 Proceedings of the International Symposium on Stochastic Algorithms: Foundations and Applications
How Can Computer Science Contribute to Knowledge Discovery?

SOFSEM '01 Proceedings of the 28th Conference on Current Trends in Theory and Practice of Informatics Piestany: Theory and Practice of Informatics
A new method for adaptive sequential sampling for learning and parameter estimation

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Active learning in the non-realizable case

ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
A random sampling approach to worst-case design of structures

Structural and Multidisciplinary Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several researchers, random sampling is difficult to use due to the difficulty of determining an appropriate sample size. In this paper, we take a sequential sampling approach for solving this difficulty, and propose an adaptive sampling algorithm that solves a general problem covering many problems arising in applications of discovery science. The algorithm obtains examples sequentially in an on-line fashion, and it determines from the obtained examples whether it has already seen a large enough number of examples. Thus, sample size is not fixed a priori; instead, it adaptively depends on the situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happens in many practical applications, then we can solve the problem with a number of examples much smaller than the required in the worst case. For illustrating the generality of our approach, we also describe how different instantiations of it can be applied to scale up knowledge discovery problems that appear in several areas.