A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases

Authors:
Tobias Scheffer;Stefan Wrobel
Affiliations:
-;-
Venue:
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2002

Citing 12
Cited 1

Decision theoretic generalizations of the PAC model for neural net and other learning applications

Information and Computation
PALO: a probabilistic hill-climbing algorithm

Artificial Intelligence
Rigorous learning curve bounds from statistical mechanics

Machine Learning - Special issue on COLT '94
Explora: a multipattern and multistrategy discovery assistant

Advances in knowledge discovery and data mining
Fast discovery of association rules

Advances in knowledge discovery and data mining
Self bounding learning algorithms

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Incremental Maximization of Non-Instance-Averaging Utility Functions with Applications to Knowledge Discovery Problems

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
An Algorithm for Multi-relational Discovery of Subgroups

PKDD '97 Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Computable Shell Decomposition Bounds

COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Finding the most interesting patterns in a database quickly by using sequential sampling

The Journal of Machine Learning Research

Efficient online mining of large databases

International Journal of Business Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many data mining tasks can be seen as an instance of the problem of finding the most interesting (according to some utility function) patterns in a large database. In recent years, significant progress has been achieved in scaling algorithms for this task to very large databases through the use of sequential sampling techniques. However, except for sampling-based greedy algorithms which cannot give absolute quality guarantees, the scalability of existing approaches to this problem is only with respect to the data, not with respect to the size of the pattern space: it is universally assumed that the entire hypothesis space fits in main memory. In this paper, we describe how this class of algorithms can be extended to hypothesis spaces that do not fit in memory while maintaining the algorithms' precise 驴 - 驴 quality guarantees. We present a constant memory algorithm for this task and prove that it possesses the required properties. In an empirical comparison, we compare variable memory and constant memory sampling.