A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

Authors:
Anna Koufakou;Michael Georgiopoulos
Affiliations:
U.A. Whitaker School of Engineering, Florida Gulf Coast University, Fort Myers, USA 33965 and School of EECS, University of Central Florida, Orlando, USA 32816;School of EECS, University of Central Florida, Orlando, USA 32816
Venue:
Data Mining and Knowledge Discovery
Year:
2010

Citing 24
Cited 4

Computational geometry: an introduction

Computational geometry: an introduction
Robust regression and outlier detection

Robust regression and outlier detection
A probabilistic resource allocating network for novelty detection

Neural Computation
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Outlier Detection Using Replicator Neural Networks

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Support Vector Data Description

Machine Learning
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
Outlier Mining in Large High-Dimensional Data Sets

IEEE Transactions on Knowledge and Data Engineering
Tight upper bounds on the number of candidate patterns

ACM Transactions on Database Systems (TODS)
Toward Unsupervised Correlation Preserving Discretization

IEEE Transactions on Knowledge and Data Engineering
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Data Mining and Knowledge Discovery
In-Network Outlier Detection in Wireless Sensor Networks

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Finding centric local outliers in categorical/numerical spaces

Knowledge and Information Systems
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Scalable and Efficient Outlier Detection Strategy for Categorical Data

ICTAI '07 Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 02
Unsupervised discretization using kernel density estimation

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A fast greedy algorithm for outlier mining

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
A survey on condensed representations for frequent sets

Proceedings of the 2004 European conference on Constraint-Based Mining and Inductive Databases

A distributed approach to detect outliers in very large data sets

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Detecting fraud in online games of chance and lotteries

Expert Systems with Applications: An International Journal
Anomaly detection in large-scale data stream networks

Data Mining and Knowledge Discovery
A scatter method for data and variable importance evaluation

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Outlier detection has attracted substantial attention in many applications and research areas; some of the most prominent applications are network intrusion detection or credit card fraud detection. Many of the existing approaches are based on calculating distances among the points in the dataset. These approaches cannot easily adapt to current datasets that usually contain a mix of categorical and continuous attributes, and may be distributed among different geographical locations. In addition, current datasets usually have a large number of dimensions. These datasets tend to be sparse, and traditional concepts such as Euclidean distance or nearest neighbor become unsuitable. We propose a fast distributed outlier detection strategy intended for datasets containing mixed attributes. The proposed method takes into consideration the sparseness of the dataset, and is experimentally shown to be highly scalable with the number of points and the number of attributes in the dataset. Experimental results show that the proposed outlier detection method compares very favorably with other state-of-the art outlier detection strategies proposed in the literature and that the speedup achieved by its distributed version is very close to linear.