CoCo: coding cost for parameter-free outlier detection

Authors:
Christian Böhm;Katrin Haegler;Nikola S. Müller;Claudia Plant
Affiliations:
University of Munich, Munich, Germany;University of Munich, Munich, Germany;Max Planck Institute of Biochemistry, Martinsried, Germany;Technische Universität München, Munich, Germany
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 13
Cited 4

LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Finding Intensional Knowledge of Distance-Based Outliers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Adaptive Ripple Down Rules Method based on Minimum Description Length Principle

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
On Digital Money and Card Technologies

On Digital Money and Card Technologies
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Robust information-theoretic clustering

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Simultaneous Classification and VisualWord Selection using Entropy-based Minimum Description Length

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 01
Outlier-robust clustering using independent components

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
MDL denoising

IEEE Transactions on Information Theory
Spatially adaptive wavelet denoising using the minimum description length principle

IEEE Transactions on Image Processing

Synchronization based outlier detection

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part III
Parameter-free anomaly detection for categorical data

MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
Measuring non-gaussianity by phi-transformed and fuzzy histograms

Advances in Artificial Neural Systems - Special issue on Advances in Unsupervised Learning Techniques Applied to Biosciences and Medicine
Outlier detection using centrality and center-proximity

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

How can we automatically spot all outstanding observations in a data set? This question arises in a large variety of applications, e.g. in economy, biology and medicine. Existing approaches to outlier detection suffer from one or more of the following drawbacks: The results of many methods strongly depend on suitable parameter settings being very difficult to estimate without background knowledge on the data, e.g. the minimum cluster size or the number of desired outliers. Many methods implicitly assume Gaussian or uniformly distributed data, and/or their result is difficult to interpret. To cope with these problems, we propose CoCo, a technique for parameter-free outlier detection. The basic idea of our technique relates outlier detection to data compression: Outliers are objects which can not be effectively compressed given the data set. To avoid the assumption of a certain data distribution, CoCo relies on a very general data model combining the Exponential Power Distribution with Independent Components. We define an intuitive outlier factor based on the principle of the Minimum Description Length together with an novel algorithm for outlier detection. An extensive experimental evaluation on synthetic and real world data demonstrates the benefits of our technique. Availability: The source code of CoCo and the data sets used in the experiments are available at: http://www.dbs.ifi.lmu.de/Forschung/KDD/Boehm/CoCo.