Efficiently mining frequent itemsets from very large databases

  • Authors:
  • Gosta Grahne;Jianfei Zhu

  • Affiliations:
  • Concordia University (Canada);Concordia University (Canada)

  • Venue:
  • Efficiently mining frequent itemsets from very large databases
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficient algorithms for mining frequent itemsets are crucial for mining association rules and for other data mining tasks. Methods for mining frequent itemsets and for iceberg data cube computation have been implemented using a prefix-tree structure, known as a FP-tree, for storing compressed frequency information. Numerous experimental results have demonstrated that these algorithms perform extremely well. In this thesis we present a novel FP-array technique that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FP-tree based algorithms. The technique works especially well for sparse datasets. We then present new algorithms for mining all frequent itemsets, maximal frequent itemsets, and closed frequent item-sets. The algorithms use the FP-tree data structure in combination with the FP-array technique efficiently, and incorporate various optimization techniques. In the algorithm for mining maximal frequent itemsets, a variant FP-tree data structure, called a MFI-tree, and an efficient maximality-checking approach are used. Another variant FP-tree data structure, called a CFI-tree, and an efficient closedness-testing approach are also given in the algorithm for mining closed frequent itemsets. Experimental results show that our methods outperform the existing methods in not only the speed of the algorithms, but also their memory consumption and their scalability. We also notice that most algorithms for mining frequent itemsets assume that the main memory is large enough for the data structures used in the mining, and very few efficient algorithms deal with the cases when the database is very large or the minimum support is very low. We thus investigate approaches to mining frequent itemsets when data structures are too large to fit in main memory. Several divide-and-conquer algorithms are presented for mining from disks. Many novel techniques are introduced. Experimental results show that the techniques reduce the required disk accesses by orders of magnitude, and enable truly scalable data mining.