Finding the most interesting patterns in a database quickly by using sequential sampling

Authors:
Tobias Scheffer;Stefan Wrobel
Affiliations:
University of Magdeburg, FIN/IWS, P.O. Box 4120, 39016 Magdeburg, Germany;FhG AiS, Schloβ Birlinghoven 53754 Sankt Augustin, Germany University of Bonn, Informatik III, Römerstr. 164, 53117 Bonn, Germany
Venue:
The Journal of Machine Learning Research
Year:
2003

Citing 13
Cited 30

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
An introduction to computational learning theory

An introduction to computational learning theory
The nature of statistical learning theory

The nature of statistical learning theory
PALO: a probabilistic hill-climbing algorithm

Artificial Intelligence
Rigorous learning curve bounds from statistical mechanics

Machine Learning - Special issue on COLT '94
Explora: a multipattern and multistrategy discovery assistant

Advances in knowledge discovery and data mining
Self bounding learning algorithms

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A sequential sampling algorithm for a general class of utility criteria

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An Algorithm for Multi-relational Discovery of Subgroups

PKDD '97 Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Computable Shell Decomposition Bounds

COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Learning to select useful landmarks

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Scalability and efficiency in multi-relational data mining

ACM SIGKDD Explorations Newsletter
Fast discovery of unexpected patterns in data, relative to a Bayesian network

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Sampling-based sequential subgroup mining

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

Data Mining and Knowledge Discovery
On the Tractability of Rule Discovery from Distributed Data

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Online Random Shuffling of Large Database Tables

IEEE Transactions on Knowledge and Data Engineering
Mining evolving data streams for frequent patterns

Pattern Recognition
Finding association rules that trade support optimally against confidence

Intelligent Data Analysis
Discovering Significant Patterns

Machine Learning
Statistical supports for mining sequential patterns and improving the incremental update process on data streams

Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Approximate mining of frequent patterns on streams

Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Layered critical values: a powerful direct-adjustment approach to discovering significant patterns

Machine Learning
Mining significant graph patterns by leap search

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Schema matching on streams with accuracy guarantees

Intelligent Data Analysis - Knowledge Discovery from Data Streams
Cluster-grouping: from subgroup discovery to clustering

Machine Learning
Approximating the number of frequent sets in dense data

Knowledge and Information Systems
The applicability to ILP of results concerning the ordering of binomial populations

ILP'02 Proceedings of the 12th international conference on Inductive logic programming
Direct local pattern sampling by efficient two-step random procedures

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Smooth boosting using an information-based criterion

ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
SD-map: a fast algorithm for exhaustive subgroup discovery

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Distributed subgroup mining

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Pattern teams

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
K-optimal pattern discovery: an efficient and effective approach to exploratory data mining

AI'05 Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence
From local to global patterns: evaluation issues in rule learning algorithms

LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection
Local pattern discovery in Array-CGH data

LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection
Knowledge-Based sampling for subgroup discovery

LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection
A dynamic adaptive sampling algorithm (DASA) for real world applications: finger print recognition and face recognition

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many discovery problems, e.g. subgroup or association rule discovery, can naturally be cast as n-best hypotheses problems where the goal is to find the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on the confidence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility is the average (over the examples) of some function --- which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worst-case sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worst-case bounds.