A distributed algorithm for outlier detection in a large database

  • Authors:
  • Biplab Kumer Sarker;Hiroyuki Kitagawa

  • Affiliations:
  • Graduate School of Systems and Information Engineering;Graduate School of Systems and Information Engineering

  • Venue:
  • DNIS'05 Proceedings of the 4th international conference on Databases in Networked Information Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a distributed algorithm to detect outliers for large and distributed datasets. The algorithm employs the basis of distance-based outliers based on the distance of a point to its kth nearest neighbor. It declares the top n points in the ranking to be outliers. To the best of our knowledge, this is the first proposal of a distributed algorithm for outlier detection for shared-nothing multiple processor computing environments. It has four phases. First, in each processing node, the algorithm partitions the input data set into disjoint subsets, then it prunes entire partitions as soon as it is determined that they cannot contain outliers. Then it applies a global filtering technique to collect the partitions as global candidates from local candidate partitions in each processing node. Further, it introduces a load balancing algorithm to balance the number of local candidate partitions. Finally, it identifies outliers from each processing node.