Active learning and subspace clustering for anomaly detection

Authors:
Karim Pichara;Alvaro Soto
Affiliations:
(Correspd. E-mail: kpb@ing.puc.cl) Vicuña Mackenna 4860, Edificio San Agustín, Macul, Santiago, Chile;Vicuña Mackenna 4860, Edificio San Agustín, Macul, Santiago, Chile
Venue:
Intelligent Data Analysis
Year:
2011

Citing 27
Cited 2

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Activity monitoring: noticing interesting changes in behavior

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Mining top-n local outliers in large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Anomaly Detection over Noisy Data using Learned Probability Distributions

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Minority report in fraud detection: classification of skewed data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
An Empirical Bayes Approach to Detect Anomalies in Dynamic Multidimensional Arrays

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Outlier detection by active learning

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Learning Bayesian Networks

Learning Bayesian Networks
A System for the Analysis of Jet Engine Vibration Data

Integrated Computer-Aided Engineering
AN ACCELERATED ALGORITHM FOR DENSITY ESTIMATION IN LARGE DATABASES USING GAUSSIAN MIXTURES

Cybernetics and Systems
UNSUPERVISED ANOMALY DETECTION IN LARGE DATABASES USING BAYESIAN NETWORKS

Applied Artificial Intelligence
Active learning for object classification: from exploration to exploitation

Data Mining and Knowledge Discovery
Anomaly detection: A survey

ACM Computing Surveys (CSUR)
Volume traffic anomaly detection using hierarchical clustering

APNOMS'09 Proceedings of the 12th Asia-Pacific network operations and management conference on Management enabling the future internet for changing business and new computing services
Learning bayesian network structure from massive datasets: the «sparse candidate« algorithm

UAI'99 Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence
Clustering and classification based anomaly detection

FSKD'06 Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery
An active learning framework for content-based information retrieval

IEEE Transactions on Multimedia

Imbalanced data classification using second-order cone programming support vector machines

Pattern Recognition
Robust classification of imbalanced data using one-class and two-class SVM-based multiclassifiers

Intelligent Data Analysis - Business Analytics and Intelligent Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today, anomaly detection is a highly valuable application in the analysis of current huge datasets. Insurance companies, banks and many manufacturing industries need systems to help humans to detect anomalies in their daily information. In general, anomalies are a very small fraction of the data, therefore their detection is not an easy task. Usually real sources of an anomaly are given by specific values expressed on selective dimensions of datasets, furthermore, many anomalies are not really interesting for humans, due to the fact that interestingness of anomalies is categorized subjectively by the human user. In this paper we propose a new semi-supervised algorithm that actively learns to detect relevant anomalies by interacting with an expert user in order to obtain semantic information about user preferences. Our approach is based on 3 main steps. First, a Bayes network identifies an initial set of candidate anomalies. Afterwards, a subspace clustering technique identifies relevant subsets of dimensions. Finally, a probabilistic active learning scheme, based on properties of Dirichlet distribution, uses the feedback from an expert user to efficiently search for relevant anomalies. Our results, using synthetic and real datasets, indicate that, under noisy data and anomalies presenting regular patterns, our approach correctly identifies relevant anomalies.