A new two-phase sampling based algorithm for discovering association rules

Authors:
Bin Chen;Peter Haas;Peter Scheuermann
Affiliations:
Exelixis, Inc., S. San Francisco, CA;IBM Almaden Research Ctr, San Jose, CA;Northwestern University, Evanston, IL
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 14
Cited 38

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Dynamic itemset counting and implication rules for market basket data

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Efficiently mining long patterns from databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Online association rule mining

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Depth first generation of long patterns

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Techniques for Online Exploration of Large Object-Relational Datasets

SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
Evaluation of Sampling for Data Mining of Association Rules

Evaluation of Sampling for Data Mining of Association Rules

Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Statistical properties of transactional databases

Proceedings of the 2004 ACM symposium on Applied computing
Association mining

ACM Computing Surveys (CSUR)
Multi-scaling sampling: an adaptive sampling method for discovering approximate association rules

Journal of Computer Science and Technology
Efficient sampling of training set in large and noisy multimedia data

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
A new deterministic data aggregation method for wireless sensor networks

Signal Processing
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Deterministic algorithms for sampling count data

Data & Knowledge Engineering
Feature-preserved sampling over streaming data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Analysis of sampling techniques for association rule mining

Proceedings of the 12th International Conference on Database Theory
A new sampling technique for association rule mining

Journal of Information Science
Estimating the confidence of conditional functional dependencies

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient Frequent Itemsets Mining by Sampling

Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006
Mining in Large Noisy Domains

Journal of Data and Information Quality (JDIQ)
Proportional fault-tolerant data mining with applications to bioinformatics

Information Systems Frontiers
Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Output space sampling for graph patterns

Proceedings of the VLDB Endowment
extraRelief: improving relief by efficient selection of instances

AI'07 Proceedings of the 20th Australian joint conference on Advances in artificial intelligence
Mining top-K frequent itemsets through progressive sampling

Data Mining and Knowledge Discovery
A new approach for generating efficient sample from market basket data

Expert Systems with Applications: An International Journal
A comparison between approximate counting and sampling methods for frequent pattern mining on data streams

Intelligent Data Analysis
A sampling based algorithm for finding association rules from uncertain data

AICI'10 Proceedings of the 2010 international conference on Artificial intelligence and computational intelligence: Part I
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Discovery of frequent patterns in transactional data streams

Transactions on large-scale data- and knowledge-centered systems II
Locality sensitive hashing for sampling-based algorithms in association rule mining

Expert Systems with Applications: An International Journal
Parallel mining of maximal sequential patterns using multiple samples

The Journal of Supercomputing
An agent for the HCARD model in the distributed environment

CIS'05 Proceedings of the 2005 international conference on Computational Intelligence and Security - Volume Part I
The k-means clustering architecture in the multi-stage data mining process

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part II
Discovering patterns based on fuzzy logic theory

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part IV
Association rule discovery in data mining by implementing principal component analysis

AIS'04 Proceedings of the 13th international conference on AI, Simulation, and Planning in High Autonomy Systems
Sampling ensembles for frequent patterns

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Efficient sampling: application to image data

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Effective sampling for mining association rules

AI'04 Proceedings of the 17th Australian joint conference on Advances in Artificial Intelligence
Memory-aware frequent k-itemset mining

KDID'05 Proceedings of the 4th international conference on Knowledge Discovery in Inductive Databases
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
ML-DS: a novel deterministic sampling algorithm for association rules mining

ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90--95% accuracy using a final sample having a size of only 15--33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database --- even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data.