Efficient Progressive Sampling for Association Rules

Authors:
Srinivasan Parthasarathy
Affiliations:
-
Venue:
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Year:
2002

Citing 0
Cited 25

A Services Oriented Framework for Next Generation Data Analysis Centers

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Distributed approximate mining of frequent patterns

Proceedings of the 2005 ACM symposium on Applied computing
Toward Unsupervised Correlation Preserving Discretization

IEEE Transactions on Knowledge and Data Engineering
Design of a next generation sampling service for large scale data analysis applications

Proceedings of the 19th annual international conference on Supercomputing
Multi-scaling sampling: an adaptive sampling method for discovering approximate association rules

Journal of Computer Science and Technology
Approximate mining of frequent patterns on streams

Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Power-law relationship and self-similarity in the itemset support distribution: analysis and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Feature-preserved sampling over streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
A new sampling technique for association rule mining

Journal of Information Science
Efficient Frequent Itemsets Mining by Sampling

Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
A test paradigm for detecting changes in transactional data streams

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
Mining top-K frequent itemsets through progressive sampling

Data Mining and Knowledge Discovery
I/O conscious algorithm design and systems support for data analysis on emerging architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Design and analysis of a multi-dimensional data sampling service for large scale data analysis applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Locality sensitive hashing for sampling-based algorithms in association rule mining

Expert Systems with Applications: An International Journal
Boolean property encoding for local set pattern discovery: an application to gene expression data analysis

LPD'04 Proceedings of the 2004 international conference on Local Pattern Detection
Sampling ensembles for frequent patterns

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Progressive sampling for association rules based on sampling error estimation

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Effective sampling for mining association rules

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
An approach for solving very large scale instances of the design distribution problem for distributed database systems

ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Proceedings of the 21st ACM international conference on Information and knowledge management
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In data mining, sampling has often been suggested as aneffective tool to reduce the size of the dataset operated atsome cost to accuracy. However, this loss to accuracy isoften difficult to measure and characterize since the exactnature of the learning curve (accuracy vs. sample size) isparameter and data dependent, i.e., we do not know aprioriwhat sample size is needed to achieve a desired accuracyon a particular dataset for a particular set of parameters.In this article we propose the use of progressive sampling todetermine the required sample size for association rule mining.We first show that a naive application of progressivesampling is not very efficient for association rule mining.We then present a refinement based on equivalence classes,that seems to work extremely well in practice and is able toconverge to the desired sample size very quickly and veryaccurately. An additional novelty of our approach is thedefinition of a support-sensitive, interactive measure of accuracyacross progressive samples.