Results on learnability and the Vapnik-Chervonenkis dimension
Information and Computation
Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
The nature of statistical learning theory
The nature of statistical learning theory
The discrepancy method: randomness and complexity
The discrepancy method: randomness and complexity
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Mining N-most Interesting Itemsets
ISMIS '00 Proceedings of the 12th International Symposium on Foundations of Intelligent Systems
A new two-phase sampling based algorithm for discovering association rules
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules
RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient Progressive Sampling for Association Rules
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Finding the most interesting patterns in a database quickly by using sequential sampling
The Journal of Machine Learning Research
Efficient data reduction with EASE
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining Frequent Itemsets without Support Threshold: With and without Item Constraints
IEEE Transactions on Knowledge and Data Engineering
TFP: An Efficient Algorithm for Mining Top-K Frequent Closed Itemsets
IEEE Transactions on Knowledge and Data Engineering
ACM Computing Surveys (CSUR)
Multi-scaling sampling: an adaptive sampling method for discovering approximate association rules
Journal of Computer Science and Technology
Improved Association Rule Mining by Modified Trimming
CIT '06 Proceedings of the Sixth IEEE International Conference on Computer and Information Technology
Frequent pattern mining: current status and future directions
Data Mining and Knowledge Discovery
The VLDB Journal — The International Journal on Very Large Data Bases
Analysis of sampling techniques for association rule mining
Proceedings of the 12th International Conference on Database Theory
A new sampling technique for association rule mining
Journal of Information Science
Efficient Frequent Itemsets Mining by Sampling
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
Efficient incremental mining of top-K frequent closed itemsets
DS'07 Proceedings of the 10th international conference on Discovery science
Mining top-K frequent itemsets through progressive sampling
Data Mining and Knowledge Discovery
A new approach for generating efficient sample from market basket data
Expert Systems with Applications: An International Journal
Relative (p,ε)-Approximations in Geometry
Discrete & Computational Geometry
Locality sensitive hashing for sampling-based algorithms in association rule mining
Expert Systems with Applications: An International Journal
Tell me what i need to know: succinctly summarizing data with itemsets
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
The research of sampling for mining frequent itemsets
RSKT'06 Proceedings of the First international conference on Rough Sets and Knowledge Technology
Sampling ensembles for frequent patterns
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Efficient sampling: application to image data
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Progressive sampling for association rules based on sampling error estimation
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Effective sampling for mining association rules
AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
Hi-index | 0.00 |
The tasks of extracting (top-K) Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-K) FI's and AR's. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a characterization of the VC-dimension of this range space and a proof that it is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call d-index, namely the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is strict for a large class of datasets. The resulting sample size for an absolute (resp. relative) (ε, δ)-approximation of the collection of FI's is $O(\frac{1}{\varepsilon^2}(d+\log\frac{1}{\delta}))$ (resp. $O(\frac{2+\varepsilon}{\varepsilon^2(2-\varepsilon)\theta}(d\log\frac{2+\varepsilon}{(2-\varepsilon)\theta}+\log\frac{1}{\delta}))$) transactions, which is a significant improvement over previous known results. We present an extensive experimental evaluation of our technique on real and artificial datasets, demonstrating the practicality of our methods, and showing that they achieve even higher quality approximations than what is guaranteed by the analysis.