Parallel Algorithms for Discovery of Association Rules

  • Authors:
  • Mohammed J. Zaki;Srinivasan Parthasarathy;Mitsunori Ogihara;Wei Li

  • Affiliations:
  • Department of Computer Science, University of Rochester, Rochester, NY 14627.;Department of Computer Science, University of Rochester, Rochester, NY 14627.;Department of Computer Science, University of Rochester, Rochester, NY 14627.;Oracle Corporation, 500 Oracle Parkway, M/S 4op9, Redwood Shores, CA 94065. E-mail: weili@us.oracle.com

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

Discovery of association rules is an important data mining task.Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms makerepeated passes over the database to determine the set of frequentitemsets (a subset of database items), thus incurringhigh I/O overhead. In the parallel case, most algorithms perform asum-reduction at the end of each pass to construct the global counts, alsoincurring high synchronization cost.In this paper we describe new parallel association mining algorithms. Thealgorithms use novel itemset clustering techniques to approximate the set ofpotentially maximal frequent itemsets. Once this set has been identified,the algorithms make use of efficient traversal techniques to generate thefrequent itemsets contained in each cluster. We propose two clusteringschemes based on equivalence classes and maximal hypergraph cliques, andstudy two lattice traversal techniques based on bottom-up and hybrid search.We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of thedatabase needed for the computation of associations is local to eachprocessor. After the initial set-up phase, the algorithms do not need anyfurther communication or synchronization. The algorithms minimize I/Ooverheads by scanning the local database portion only twice. Once in theset-up phase, and once when processing the itemset clusters. Unlike previousparallel approaches, the algorithms use simple intersection operations tocompute frequent itemsets and do not have to maintain or search complex hashstructures.Our experimental testbed is a 32-processor DEC Alpha clusterinter-connected by the Memory Channel network. We present results on theperformance of our algorithms on various databases, and compare it against awell known parallel algorithm. The best new algorithm outperforms it by anorder of magnitude.