New techniques for efficiently discovering frequent patterns

  • Authors:
  • Gagan Agrawal;Ruoming Jin

  • Affiliations:
  • The Ohio State University;The Ohio State University

  • Venue:
  • New techniques for efficiently discovering frequent patterns
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Because of its theoretical and practical importance, the field of frequent pattern mining has been and remain to be one of the most active research area in KDD. In this dissertation, we study three different problems in frequent pattern mining, mining multiple datasets, mining streaming data, and mining large-scale structures from graph datasets. Our study has not only extended the breadth of frequent pattern mining, but also brought new techniques and algorithms into this field. Specifically, our contributions are as follows. (1) Mining multiple datasets. We develop a systematic approach to generate efficient query plans for a single mining query across multiple datasets. We also propose methods to simultaneously optimize multiple such queries and utilize the past mining results in a query-intensive KDD environment. Our experimental results have shown a speedup up to two-order of magnitude comparing with the naive methods without these optimizations. (2) Mining frequent itemsets over streaming data. We propose a new algorithm StreamMining to discover the frequent itemsets over streaming data. In a single pass, StreamMining will guarantee to find a superset of frequent itemsets, but false positive may occur. If the second pass is allowed, StreamMining will be able to remove the false positive and find the exact frequent itemsets. Our detailed evaluation using both synthetic and real datasets has shown our one-pass algorithm is very accurate in practice, and is also very memory efficient. (3)  Mining frequent large-scale structures from graph datasets. We develop a new framework to discover the frequent large-scale structures from graph datasets. This framework is derived from a mathematical concept, topological minor. In this framework, we propose a new algorithm TSMiner, which efficiently enumerates all the frequent large-scale structures in a graph dataset, and a new approach called relabeling function to perform constraint mining. We apply our framework to protein structure data and discover meaningful topological structures. Finally, we demonstrate the viability and scalability of the proposed algorithms on both real and synthetic datasets.