Fast and reliable anomaly detection in categorical data

Authors:
Leman Akoglu;Hanghang Tong;Jilles Vreeken;Christos Faloutsos
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;IBM T. J. Watson, Hawthorne, NY, USA;University of Antwerp, Mathematics and Computer Science, Belgium;Carnegie Mellon University, Pittsburgh, PA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 13
Cited 0

An introduction to Kolmogorov complexity and its applications

An introduction to Kolmogorov complexity and its applications
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Outlier detection for high dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A symbolic representation of time series, with implications for streaming algorithms

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Detecting anomalous records in categorical datasets

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery
The Discrete Basis Problem

IEEE Transactions on Knowledge and Data Engineering
Finding Good Itemsets by Packing Data

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Krimp: mining itemsets that compress

Data Mining and Knowledge Discovery
OddBall: spotting anomalies in weighted graphs

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Anomaly Detection for Discrete Sequences: A Survey

IEEE Transactions on Knowledge and Data Engineering
Paper: Modeling by shortest data description

Automatica (Journal of IFAC)
Summarizing categorical data by clustering attributes

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spotting anomalies in large multi-dimensional databases is a crucial task with many applications in finance, health care, security, etc. We introduce COMPREX, a new approach for identifying anomalies using pattern-based compression. Informally, our method finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags those points dissimilar to the norm---with high compression cost---as anomalies. Our approach exhibits four key features: 1) it is parameter-free; it builds dictionaries directly from data, and requires no user-specified parameters such as distance functions or density and similarity thresholds, 2) it is general; we show it works for a broad range of complex databases, including graph, image and relational databases that may contain both categorical and numerical features, 3) it is scalable; its running time grows linearly with respect to both database size as well as number of dimensions, and 4) it is effective; experiments on a broad range of datasets show large improvements in both compression, as well as precision in anomaly detection, outperforming its state-of-the-art competitors.