Progressive sampling for association rules based on sampling error estimation

Authors:
Kun-Ta Chuang;Ming-Syan Chen;Wen-Chieh Yang
Affiliations:
Graduate Institute of Communication Engineering, National Taiwan University, Intelligent Business Technology Inc., Taipei, Taiwan, ROC;Graduate Institute of Communication Engineering, National Taiwan University, Intelligent Business Technology Inc., Taipei, Taiwan, ROC;Graduate Institute of Communication Engineering, National Taiwan University, Intelligent Business Technology Inc., Taipei, Taiwan, ROC
Venue:
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Year:
2005

Citing 5
Cited 5

Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Modern Information Retrieval

Modern Information Retrieval
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient Progressive Sampling for Association Rules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining

Power-law relationship and self-similarity in the itemset support distribution: analysis and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Feature-preserved sampling over streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Summary queries for frequent itemsets mining

Journal of Systems and Software
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

We explore in this paper a progressive sampling algorithm, called Sampling Error Estimation (SEE), which aims to identify an appropriate sample size for mining association rules. SEE has two advantages over previous works in the literature. First, SEE is highly efficient because an appropriate sample size can be determined without the need of executing association rules. Second, the identified sample size of SEE is very accurate, meaning that association rules can be highly efficiently executed on a sample of this size to obtain a sufficiently accurate result. This is attributed to the merit of SEE for being able to significantly reduce the influence of randomness by examining several samples with the same size in one database scan. As validated by experiments on various real data and synthetic data, SEE can achieve very prominent improvement in efficiency and also the resulting accuracy over previous works.