A new two-phase sampling based algorithm for discovering association rules

  • Authors:
  • Bin Chen;Peter Haas;Peter Scheuermann

  • Affiliations:
  • Exelixis, Inc., S. San Francisco, CA;IBM Almaden Research Ctr, San Jose, CA;Northwestern University, Evanston, IL

  • Venue:
  • Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces FAST, a novel two-phase sampling-based algorithm for discovering association rules in large databases. In Phase I a large initial sample of transactions is collected and used to quickly and accurately estimate the support of each individual item in the database. In Phase II these estimated supports are used to either trim "outlier" transactions or select "representative" transactions from the initial sample, thereby forming a small final sample that more accurately reflects the statistical characteristics (i.e., itemset supports) of the entire database. The expensive operation of discovering association rules is then performed on the final sample. In an empirical study, FAST was able to achieve 90--95% accuracy using a final sample having a size of only 15--33% of that of a comparable random sample. This efficiency gain resulted in a speedup by roughly a factor of 10 over previous algorithms that require expensive processing of the entire database --- even efficient algorithms that exploit sampling. Our new sampling technique can be used in conjunction with almost any standard association-rule algorithm, and can potentially render scalable other algorithms that mine "count" data.