Cell-based outlier detection algorithm: a fast outlier detection algorithm for large datasets

Authors:
You Wan;Fuling Bian
Affiliations:
Research Center of Spatial Information and Digital Engineering, International School of Software, Wuhan University, Wuhan, China;Research Center of Spatial Information and Digital Engineering, International School of Software, Wuhan University, Wuhan, China
Venue:
PAKDD'08 Proceedings of the 12th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2008

Citing 8
Cited 1

LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Enhancing Effectiveness of Outlier Detections for Low Density Patterns

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Discovering cluster-based local outliers

Pattern Recognition Letters
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining

Enhancing effectiveness of density-based outlier mining scheme with density-similarity-neighbor-based outlier factor

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding outliers is an important task for many KDD applications. We developed a cell-based outlier detection algorithm (short for CEBOD) to detect outliers in large dataset. The algorithm is based on LOF; major difference is CEBOD can avoid large computations on the majority part of dataset by filter the initial dataset. Our experiment shows that CEBOD is more efferent than LOF, and can find outliers in large datasets fast and accurately. A large dataset is loaded into memory by blocks, and the data are placed into appropriate cells based on their values. Each cell holds a certain number of data, which represents the cell's density. Data locate in high density cells and have no nearness relationship with local outlier factor calculation are filtered. And we record these cells' density for the next block of data fill in. The final calculation will be done on those data in low density cells. In this way, we can handle a large dataset which can't be loaded into memory once, improving the algorithm's efficiency by reducing many useless computations. The time complexity of CEBOD is O(N).