Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
An introduction to computational learning theory
An introduction to computational learning theory
The nature of statistical learning theory
The nature of statistical learning theory
PALO: a probabilistic hill-climbing algorithm
Artificial Intelligence
Rigorous learning curve bounds from statistical mechanics
Machine Learning - Special issue on COLT '94
Explora: a multipattern and multistrategy discovery assistant
Advances in knowledge discovery and data mining
Self bounding learning algorithms
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Mining high-speed data streams
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
A sequential sampling algorithm for a general class of utility criteria
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An Algorithm for Multi-relational Discovery of Subgroups
PKDD '97 Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Computable Shell Decomposition Bounds
COLT '00 Proceedings of the Thirteenth Annual Conference on Computational Learning Theory
Learning to select useful landmarks
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
A Scalable Constant-Memory Sampling Algorithm for Pattern Discovery in Large Databases
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Scalability and efficiency in multi-relational data mining
ACM SIGKDD Explorations Newsletter
Fast discovery of unexpected patterns in data, relative to a Bayesian network
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Sampling-based sequential subgroup mining
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing
Data Mining and Knowledge Discovery
On the Tractability of Rule Discovery from Distributed Data
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Online Random Shuffling of Large Database Tables
IEEE Transactions on Knowledge and Data Engineering
Mining evolving data streams for frequent patterns
Pattern Recognition
Finding association rules that trade support optimally against confidence
Intelligent Data Analysis
Discovering Significant Patterns
Machine Learning
Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Approximate mining of frequent patterns on streams
Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Mining significant graph patterns by leap search
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Schema matching on streams with accuracy guarantees
Intelligent Data Analysis - Knowledge Discovery from Data Streams
Cluster-grouping: from subgroup discovery to clustering
Machine Learning
Approximating the number of frequent sets in dense data
Knowledge and Information Systems
The applicability to ILP of results concerning the ordering of binomial populations
ILP'02 Proceedings of the 12th international conference on Inductive logic programming
Direct local pattern sampling by efficient two-step random procedures
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Smooth boosting using an information-based criterion
ALT'06 Proceedings of the 17th international conference on Algorithmic Learning Theory
SD-map: a fast algorithm for exhaustive subgroup discovery
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
K-optimal pattern discovery: an efficient and effective approach to exploratory data mining
AI'05 Proceedings of the 18th Australian Joint conference on Advances in Artificial Intelligence
From local to global patterns: evaluation issues in rule learning algorithms
LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection
Local pattern discovery in Array-CGH data
LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection
Knowledge-Based sampling for subgroup discovery
LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
ACM Transactions on Knowledge Discovery from Data (TKDD)
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Hi-index | 0.00 |
Many discovery problems, e.g. subgroup or association rule discovery, can naturally be cast as n-best hypotheses problems where the goal is to find the n hypotheses from a given hypothesis space that score best according to a certain utility function. We present a sampling algorithm that solves this problem by issuing a small number of database queries while guaranteeing precise bounds on the confidence and quality of solutions. Known sampling approaches have treated single hypothesis selection problems, assuming that the utility is the average (over the examples) of some function --- which is not the case for many frequently used utility functions. We show that our algorithm works for all utilities that can be estimated with bounded error. We provide these error bounds and resulting worst-case sample bounds for some of the most frequently used utilities, and prove that there is no sampling algorithm for a popular class of utility functions that cannot be estimated with bounded error. The algorithm is sequential in the sense that it starts to return (or discard) hypotheses that already seem to be particularly good (or bad) after a few examples. Thus, the algorithm is almost always faster than its worst-case bounds.