Detecting outliers using transduction and statistical testing

Authors:
Daniel Barbará;Carlotta Domeniconi;James P. Rogers
Affiliations:
George Mason University, Fairfax, VA;George Mason University, Fairfax, VA;U.S. Army Engineer Research and Development Center, Alexandria, VA
Venue:
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2006

Citing 19
Cited 13

Randomization tests

Randomization tests
An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Multidimensional binary search trees used for associative searching

Communications of the ACM
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Prediction algorithms and confidence measures based on algorithmic randomness theory

Theoretical Computer Science - Natural computing
Transductive Confidence Machines for Pattern Recognition

ECML '02 Proceedings of the 13th European Conference on Machine Learning
High Dimensional Similarity Search With Space Filling Curves

Proceedings of the 17th International Conference on Data Engineering
Machine-Learning Applications of Algorithmic Randomness

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Enhancing Effectiveness of Outlier Detections for Low Density Patterns

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A Unified Approach to Detecting Spatial Outliers

Geoinformatica
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning relational probability trees

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Outlier Mining in Large High-Dimensional Data Sets

IEEE Transactions on Knowledge and Data Engineering

Network anomaly detection based on TCM-KNN algorithm

ASIACCS '07 Proceedings of the 2nd ACM symposium on Information, computer and communications security
Machine learning approaches to network anomaly detection

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
An anomaly intrusion detection method using the CSI-KNN algorithm

Proceedings of the 2008 ACM symposium on Applied computing
TCM-KNN scheme for network anomaly detection using feature-based optimizations

Proceedings of the 2008 ACM symposium on Applied computing
Effective image retrieval using dominant color descriptor and fuzzy support vector machine

Pattern Recognition
A Novel Data Mining Method for Network Anomaly Detection Based on Transductive Scheme

ISNN '07 Proceedings of the 4th international symposium on Neural Networks: Advances in Neural Networks
A lightweight web server anomaly detection method based on transductive scheme and genetic algorithms

Computer Communications
TCM-KNN algorithm for supervised network intrusion detection

PAISI'07 Proceedings of the 2007 Pacific Asia conference on Intelligence and security informatics
Optimizing network anomaly detection scheme using instance selection mechanism

GLOBECOM'09 Proceedings of the 28th IEEE conference on Global telecommunications
Detecting activities from body-worn accelerometers via instance-based algorithms

Pervasive and Mobile Computing
Topology preserving SOM with transductive confidence machine

DS'10 Proceedings of the 13th international conference on Discovery science
Outlier detection by example

Journal of Intelligent Information Systems
Quantifying the reliability of fault classifiers

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., density-based), and use ad-hoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.