AN ACCELERATED ALGORITHM FOR DENSITY ESTIMATION IN LARGE DATABASES USING GAUSSIAN MIXTURES

Authors:
Alvaro Soto;Felipe Zavala;Anita Araneda
Affiliations:
Department of Computer Science, Pontificia Universidad Católica de Chile, Casilla, Santiago, Chile;Department of Computer Science, Pontificia Universidad Católica de Chile, Casilla, Santiago, Chile;Department of Statistics, Pontificia Universidad Católica de Chile, Casilla, Santiago, Chile
Venue:
Cybernetics and Systems
Year:
2007

Citing 8
Cited 3

Bumptrees for efficient function, constraint, and classification learning

NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Multidimensional binary search trees used for associative searching

Communications of the ACM
Machine Learning

Machine Learning
Introduction to Algorithms

Introduction to Algorithms
Repairing Faulty Mixture Models using Density Estimation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence

Detection of Anomalies in Large Datasets Using an Active Learning Scheme Based on Dirichlet Distributions

IBERAMIA '08 Proceedings of the 11th Ibero-American conference on AI: Advances in Artificial Intelligence
UNSUPERVISED ANOMALY DETECTION IN LARGE DATABASES USING BAYESIAN NETWORKS

Applied Artificial Intelligence
Active learning and subspace clustering for anomaly detection

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today, with the advances of computer storage and technology, there are huge datasets available, offering an opportunity to extract valuable information. Probabilistic approaches are specially suited to learn from data by representing knowledge as density functions. In this paper, we choose Gaussian mixture models (GMMs) to represent densities, as they possess great flexibility to adequate to a wide class of problems. The classical estimation approach for GMMs corresponds to the iterative algorithm of expectation maximization (EM). This approach, however, does not scale properly to meet the high demanding processing requirements of large databases. In this paper we introduce an EM-based algorithm, that solves the scalability problem. Our approach is based on the concept of data condensation which, in addition to substantially diminishing the computational load, provides sound starting values that allow the algorithm to reach convergence faster. We also focus on the model selection problem. We test our algorithm using synthetic and real databases, and find several advantages, when compared to other standard existing procedures.