Scalable Data Mining for Rules

  • Authors:
  • M. J. Zaki

  • Affiliations:
  • -

  • Venue:
  • Scalable Data Mining for Rules
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data mining is the process of automatic extraction of novel, useful, and understandable patterns in very large databases. High-performance scalable and parallel computing is crucial for ensuring system scalability and inter-activity as datasets grow inexorably in size and complexity. This thesis deals with both the algorithmic and systems aspects of scalable and parallel data mining algorithms applied to massive databases. The algorithmic aspects focus on the design of efficient, scalable, disk-based parallel algorithms for three key rule discovery techniques---association rules, sequence discovery, and decision tree classification. The systems aspects deal with the scalable implementation of these methods on both sequential machines and popular parallel hardware ranging from shared-memory systems (SMP) to hybrid hierarchical clusters of networked SMP workstations. .pp The association and sequence mining algorithms use lattice-theoretic combinatorial properties to decompose the original problem into small independent sub-problems that can be solved in main memory. Using efficient search techniques and simple intersection operations all frequent patterns are enumerated in a few database scans. The parallel algorithms are asynchronous, requiring no communication or synchronization after an initial set-up phase. Furthermore, the algorithms are based on a hierarchical parallelization, utilizing both shared-memory and message-passing primitives. In classification rule mining, we present disk-based parallel algorithms on shared-memory multiprocessors, the first such study. Extensive experiments have been conducted for all three problems, showing immense improvement over previous approaches, with linear scalability in database size.