Outlier Mining in Large High-Dimensional Data Sets

Authors:
Fabrizio Angiulli;Clara Pizzuti
Affiliations:
-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 22
Cited 40

Computational geometry: an introduction

Computational geometry: an introduction
Multiattribute hashing using Gray codes

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
Fractals for secondary key retrieval

PODS '89 Proceedings of the eighth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Linear clustering of objects with multiple attributes

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Approximate nearest neighbor queries revisited

SCG '97 Proceedings of the thirteenth annual symposium on Computational geometry
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Discovery of fraud rules for telecommunications—challenges and solutions

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: concepts and techniques

Data mining: concepts and techniques
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Analysis of the Clustering Properties of the Hilbert Space-Filling Curve

IEEE Transactions on Knowledge and Data Engineering
Findout: finding outliers in very large datasets

Knowledge and Information Systems
High Dimensional Similarity Search With Space Filling Curves

Proceedings of the 17th International Conference on Data Engineering
Fast Outlier Detection in High Dimensional Spaces

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Distance-based outliers: algorithms and applications

The VLDB Journal — The International Journal on Very Large Data Bases
Outlier detection and localisation with wavelet based multifractal formalism

Outlier detection and localisation with wavelet based multifractal formalism
Identifying and eliminating mislabeled training instances

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Distance-Based Detection and Prediction of Outliers

IEEE Transactions on Knowledge and Data Engineering
Detecting outliers using transduction and statistical testing

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Projective clustering using itemset discovery for multi-dimensional data analysis

MS'06 Proceedings of the 17th IASTED international conference on Modelling and simulation
Detecting outliers in interval data

Proceedings of the 44th annual Southeast regional conference
Topological approaches to covering rough sets

Information Sciences: an International Journal
Outlier detection by logic programming

ACM Transactions on Computational Logic (TOCL)
Very efficient mining of distance-based outliers

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Detecting distance-based outliers in streams of data

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Outlier detection using default reasoning

Artificial Intelligence
Detecting outlier samples in multivariate time series dataset

Knowledge-Based Systems
DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets

ACM Transactions on Knowledge Discovery from Data (TKDD)
Detecting outlying properties of exceptional objects

ACM Transactions on Database Systems (TODS)
A hybrid novelty score and its use in keystroke dynamics-based user authentication

Pattern Recognition
Approximate minimum spanning tree clustering in high-dimensional space

Intelligent Data Analysis
A comprehensive survey of numeric and symbolic outlier mining techniques

Intelligent Data Analysis
Efficient Pruning Schemes for Distance-Based Outlier Detection

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Mining Violations to Relax Relational Database Constraints

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
A comparison of outlier detection algorithms for ITS data

Expert Systems with Applications: An International Journal
A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

Data Mining and Knowledge Discovery
Distance-based outlier queries in data streams: the novel task and algorithms

Data Mining and Knowledge Discovery
Reduction about approximation spaces of covering generalized rough sets

International Journal of Approximate Reasoning
ODDC: outlier detection using distance distribution clustering

PAKDD'07 Proceedings of the 2007 international conference on Emerging technologies in knowledge discovery and data mining
Enhancing effectiveness of density-based outlier mining scheme with density-similarity-neighbor-based outlier factor

Expert Systems with Applications: An International Journal
Fuzzy clustering-based approach for outlier detection

ACE'10 Proceedings of the 9th WSEAS international conference on Applications of computer engineering
New outlier detection method based on fuzzy clustering

WSEAS Transactions on Information Science and Applications
A fast algorithm for robust mixtures in the presence of measurement errors

IEEE Transactions on Neural Networks
A distributed approach to detect outliers in very large data sets

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
On detecting clustered anomalies using SCiForest

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Finding key attribute subset in dataset for outlier detection

Knowledge-Based Systems
Outlier detection by example

Journal of Intelligent Information Systems
An unbiased distance-based outlier detection approach for high-dimensional data

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications - Volume Part I
Binary relation based rough sets

FSKD'06 Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery
Multi knowledge based rough approximations and applications

Knowledge-Based Systems
Disclosing the element distribution of bloom filter

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part I
An application of rough sets to graph theory

Information Sciences: an International Journal
A minimum spanning tree-inspired clustering-based outlier detection technique

ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Enhancing minimum spanning tree-based clustering by removing density-based outliers

Digital Signal Processing
Exploiting domain knowledge to detect outliers

Data Mining and Knowledge Discovery
A multivariate fuzzy system applied for outliers detection

Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a new definition of distance-based outlier and an algorithm, called HilOut, designed to efficiently detect the top n outliers of a large and high-dimensional data set are proposed. Given an integer k, the weight of a point is defined as the sum of the distances separating it from its k nearest-neighbors. Outlier are those points scoring the largest values of weight. The algorithm HilOut makes use of the notion of space-filling curve to linearize the data set, and it consists of two phases. The first phase provides an approximate solution, within a rough factor, after the execution of at most d + 1 sorts and scans of the data set, with temporal cost quadratic in d and linear in N and in k, where d is the number of dimensions of the data set and N is the number of points in the data set. During this phase, the algorithm isolates points candidate to be outliers and reduces this set at each iteration. If the size of this set becomes n, then the algorithm stops reporting the exact solution. The second phase calculates the exact solution with a final scan examining further the candidate outliers that remained after the first phase. Experimental results show that the algorithm always stops, reporting the exact solution, during the first phase after much less than d + 1 steps. We present both an in-memory and disk-based implementation of the HilOut algorithm and a thorough scaling analysis for real and synthetic data sets showing that the algorithm scales well in both cases.