Detecting anomalous records in categorical datasets

Authors:
Kaustav Das;Jeff Schneider
Affiliations:
Carnegie Mellon University;Carnegie Mellon University
Venue:
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2007

Citing 15
Cited 14

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Anomaly Detection over Noisy Data using Learned Probability Distributions

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Finding surprising patterns in a time series database in linear time and space

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Rule-based anomaly pattern detection for detecting disease outbreaks

Eighteenth national conference on Artificial intelligence
Mining Motifs in Massive Time Series Databases

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable and practical probability density estimators for scientific anomaly detection

Scalable and practical probability density estimators for scientific anomaly detection
Unsupervised anomaly detection in network intrusion detection using clusters

ACSC '05 Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38
An association-based dissimilarity measure for categorical data

Pattern Recognition Letters
A study in using neural networks for anomaly and misuse detection

SSYM'99 Proceedings of the 8th conference on USENIX Security Symposium - Volume 8
Data mining approaches for intrusion detection

SSYM'98 Proceedings of the 7th conference on USENIX Security Symposium - Volume 7
Intrusion detection using sequences of system calls

Journal of Computer Security
Working Sets Past and Present

IEEE Transactions on Software Engineering
Cached sufficient statistics for efficient machine learning with large datasets

Journal of Artificial Intelligence Research
A statistically based system for prioritizing information exploration under uncertainty

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans

Anomaly pattern detection in categorical datasets

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
A Fast Feature-Based Method to Detect Unusual Patterns in Multidimensional Datasets

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Detecting outliers in categorical record databases based on attribute associations

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Outlier detection in transactional data

Intelligent Data Analysis
Spatiotemporal Models for Data-Anomaly Detection in Dynamic Environmental Monitoring Campaigns

ACM Transactions on Sensor Networks (TOSN)
Anomaly detection in categorical datasets using bayesian networks

AICI'11 Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part II
Spatial categorical outlier detection: pair correlation function based approach

Proceedings of the 19th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
DHCC: Divisive hierarchical clustering of categorical data

Data Mining and Knowledge Discovery
A-GHSOM: An adaptive growing hierarchical self organizing map for network anomaly detection

Journal of Parallel and Distributed Computing
Fast and reliable anomaly detection in categorical data

Proceedings of the 21st ACM international conference on Information and knowledge management
Mining multidimensional contextual outliers from categorical relational data

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Fast generalized subset scan for anomalous pattern detection

The Journal of Machine Learning Research
A ranking-based algorithm for detection of outliers in categorical data

International Journal of Hybrid Intelligent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of detecting anomalies in high aritycategorical datasets. In most applications, anomalies are defined as datapoints that are "abnormal". Quite often we have access to data which consists mostly of normal records, a long with a small percentage of unlabelled anomalous records. We are interested in the problem of unsupervised anomaly detection, where we use the unlabelled data for training, and detect records that do not follow the definition of normality. A standard approach is to create a model of normal data, and compare test records against it. A probabilistic approach builds a likelihood model from the training data. Records are tested for anomalies based on the complete record likelihood given the probability model. For categorical attributes, bayes nets give a standard representation of the likelihood. While this approach is good at finding outliers in the dataset, it often tends to detect records with attribute values that are rare. Sometimes, just detecting rare values of an attribute is not desired and such outliers are not considered as anomalies in that context. We present an alternative definition of anomalies, and propose an approach of comparing against marginal distribution of attribute subsets. We show that this is a more meaningful way of detecting anomalies, and has a better performance over semi-synthetic as well as real world datasets.