A sampling-based framework for parallel data mining

  • Authors:
  • Shengnan Cong;Jiawei Han;Jay Hoeflinger;David Padua

  • Affiliations:
  • University of Illinois, Urbana, IL;University of Illinois, Urbana, IL;Intel Americas, Inc., Champaign, IL;University of Illinois, Urbana, IL

  • Venue:
  • Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The goal of data mining algorithm is to discover useful information embedded in large databases. Frequent itemset mining and sequential pattern mining are two important data mining problems with broad applications. Perhaps the most efficient way to solve these problems sequentially is to apply a pattern-growth algorithm, which is a divide-and-conquer algorithm [9, 10]. In this paper, we present a framework for parallel mining frequent itemsets and sequential patterns based on the divide-and-conquer strategy of pattern growth. Then, we discuss the load balancing problem and introduce a sampling technique, called selective sampling, to address this problem. We implemented parallel versions of both frequent itemsets and sequential pattern mining algorithms following our framework. The experimental results show that our parallel algorithms usually achieve excellent speedups.