Parallel Mining of Outliers in Large Database

  • Authors:
  • Edward Hung;David W. Cheung

  • Affiliations:
  • Department of Computer Science and Information Systems, The University of Hong Kong, Hong Kong, People's Republic of China. ehung@cs.umd.edu;Department of Computer Science and Information Systems, The University of Hong Kong, Hong Kong, People's Republic of China. dcheung@csis.hku.hk

  • Venue:
  • Distributed and Parallel Databases
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data mining is a new, important and fast growing database application. Outlier (exception) detection is one kind of data mining, which can be applied in a variety of areas like monitoring of credit card fraud and criminal activities in electronic commerce. With the ever-increasing size and attributes (dimensions) of database, previously proposed detection methods for two dimensions are no longer applicable. The time complexity of the Nested-Loop (NL) algorithm (Knorr and Ng, in Proc. 24th VLDB, 1998) is linear to the dimensionality but quadratic to the dataset size, inducing an unacceptable cost for large dataset.A more efficient version (ENL) and its parallel version (PENL) are introduced. In theory, the improvement of performance in PENL is linear to the number of processors, as shown in a performance comparison between ENL and PENL using Bulk Synchronization Parallel (BSP) model. The great improvement is further verified by experiments on a parallel computer system IBM 9076 SP2. The results show that it is a very good choice to mine outliers in a cluster of workstations with a low-cost interconnected by a commodity communication network.