Parallel mining of top-k frequent itemsets in very large text database

  • Authors:
  • Yongheng Wang;Yan Jia;Shuqiang Yang

  • Affiliations:
  • Institute of Network, Computer School, National Universty of Defense Technology, Changsha, China;Institute of Network, Computer School, National Universty of Defense Technology, Changsha, China;Institute of Network, Computer School, National Universty of Defense Technology, Changsha, China

  • Venue:
  • WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Frequent itemsets mining is a common and useful task in data mining. But most of the current mining algorithms can’t be used in very large text database. In this paper, we propose a novel and efficient parallel algorithm parTFI which is used to find top-k frequent itemsets with specified minimum length in very large text database. Base on a simple data structure H-struct, parTFI uses a novel logical vertical data partition technique to mine top-k frequent itemsets at each mining server parallel. Our performance study shows that when processing very large sparse text database, parTFI outperforms Apriori and FP-growth, two efficient frequent iemsets mining algorithms, even when both are running with the better tuned min_support. Furthermore, by creating H-struct dynamically, parTFI can suit even huge dataset that most other algorithms can’t process.