A distributed algorithm for outlier detection in a large database

Authors:
Biplab Kumer Sarker;Hiroyuki Kitagawa
Affiliations:
Graduate School of Systems and Information Engineering;Graduate School of Systems and Information Engineering
Venue:
DNIS'05 Proceedings of the 4th international conference on Databases in Networked Information Systems
Year:
2005

Citing 7
Cited 0

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a distributed algorithm to detect outliers for large and distributed datasets. The algorithm employs the basis of distance-based outliers based on the distance of a point to its kth nearest neighbor. It declares the top n points in the ranking to be outliers. To the best of our knowledge, this is the first proposal of a distributed algorithm for outlier detection for shared-nothing multiple processor computing environments. It has four phases. First, in each processing node, the algorithm partitions the input data set into disjoint subsets, then it prunes entire partitions as soon as it is determined that they cannot contain outliers. Then it applies a global filtering technique to collect the partitions as global candidates from local candidate partitions in each processing node. Further, it introduces a load balancing algorithm to balance the number of local candidate partitions. Finally, it identifies outliers from each processing node.