Detection of Anomalies in Large Datasets Using an Active Learning Scheme Based on Dirichlet Distributions

Authors:
Karim Pichara;Alvaro Soto;Anita Araneda
Affiliations:
Pontificia Universidad Católica de, Chile;Pontificia Universidad Católica de, Chile;Pontificia Universidad Católica de, Chile
Venue:
IBERAMIA '08 Proceedings of the 11th Ibero-American conference on AI: Advances in Artificial Intelligence
Year:
2008

Citing 10
Cited 0

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Case-based reasoning: foundational issues, methodological variations, and system approaches

AI Communications
Introduction to Expert Systems

Introduction to Expert Systems
Toward Optimal Active Learning through Sampling Estimation of Error Reduction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
Learning Bayesian Networks

Learning Bayesian Networks
AN ACCELERATED ALGORITHM FOR DENSITY ESTIMATION IN LARGE DATABASES USING GAUSSIAN MIXTURES

Cybernetics and Systems
UNSUPERVISED ANOMALY DETECTION IN LARGE DATABASES USING BAYESIAN NETWORKS

Applied Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today, the detection of anomalous records is a highly valuable application in the analysis of current huge datasets. In this paper we propose a new algorithm that, with the help of a human expert, efficiently explores a dataset with the goal of detecting relevant anomalous records. Under this scheme the computer selectively asks the expert for data labeling, looking for relevant semantic feedback in order to improve its knowledge about what characterizes a relevant anomaly. Our rationale is that while computers can process huge amounts of low level data, an expert has high level semantic knowledge to efficiently lead the search. We build upon our previous work based on Bayesian networks that provides an initial set of potential anomalies. In this paper, we augment this approach with an active learning scheme based on the clustering properties of Dirichlet distributions. We test the performance of our algorithm using synthetic and real datasets. Our results indicate that, under noisy data and anomalies presenting regular patterns, our approach significantly reduces the rate of false positives, while decreasing the time to reach the relevant anomalies.