A fast algorithm for robust mixtures in the presence of measurement errors

Authors:
Jianyong Sun;Ata Kabán
Affiliations:
Center for Plant Integrative Biology, School of Bioscience, The University of Nottingham, Sutton Bonington, UK and Tianjin Key Laboratory of Intelligence Computing and Novel Software Technology, T ...;School of Computer Science, University of Birmingham, Birmingham, UK
Venue:
IEEE Transactions on Neural Networks
Year:
2010

Citing 31
Cited 2

A view of the EM algorithm that justifies incremental, sparse, and other variants

Proceedings of the NATO Advanced Study Institute on Learning in graphical models
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An Introduction to Variational Methods for Graphical Models

Machine Learning
Very fast EM-based mixture model clustering using multiresolution kd-trees

Proceedings of the 1998 conference on Advances in neural information processing systems II
Multidimensional binary search trees used for associative searching

Communications of the ACM
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Robust mixture modelling using the t distribution

Statistics and Computing
Algorithms for Mining Distance-Based Outliers in Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Learning Mixtures of Gaussians

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Learning from Incomplete Data

Learning from Incomplete Data
Outlier Mining in Large High-Dimensional Data Sets

IEEE Transactions on Knowledge and Data Engineering
Robust probabilistic projections

ICML '06 Proceedings of the 23rd international conference on Machine learning
An introduction to ROC analysis

Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Accelerated EM-based clustering of large data sets

Data Mining and Knowledge Discovery
Robust Bayesian clustering

Neural Networks
Suboptimal behavior of Bayes and MDL in classification under misspecification

Machine Learning
Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution

Computational Statistics & Data Analysis
Robust mixtures in the presence of measurement errors

Proceedings of the 24th international conference on Machine learning
Mixtures of robust probabilistic principal component analyzers

Neurocomputing
Robust fuzzy clustering using mixtures of Student's-t distributions

Pattern Recognition Letters
Cached sufficient statistics for efficient machine learning with large datasets

Journal of Artificial Intelligence Research
Factor analysis latent subspace modeling and robust fuzzy clustering using t-distributions

IEEE Transactions on Fuzzy Systems
Variational Bayesian sparse kernel-based blind image deconvolution with student's-t priors

IEEE Transactions on Image Processing
Robust Bayesian mixture modelling

Neurocomputing
Variational inference for Student-t models: Robust Bayesian interpolation and generalised component analysis

Neurocomputing
Signal Modeling and Classification Using a Robust Latent Space Model Based on Distributions

IEEE Transactions on Signal Processing
Variational Bayesian Image Restoration Based on a Product of -Distributions Image Prior

IEEE Transactions on Image Processing
Variational learning and bits-back coding: an information-theoretic view to Bayesian learning

IEEE Transactions on Neural Networks
Distributed EM Algorithm for Gaussian Mixtures in Sensor Networks

IEEE Transactions on Neural Networks

Robust mixture clustering using Pearson type VII distribution

Pattern Recognition Letters
A novel split-and-merge algorithm for hierarchical clustering of Gaussian mixture models

Applied Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In experimental and observational sciences, detecting atypical, peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domain-specific followup study. However, measurement data is nearly never free of measurement errors. These errors can generate false outliers that are not truly interesting. Although many approaches exist for finding outliers, they have no means to tell to what extent the peculiarity is not simply due to measurement errors. To address this issue, we have developed a model-based approach to infer genuine outliers from multivariate data sets when measurement error information is available. This is based on a probabilistic mixture of hierarchical density models, in which parameter estimation is made feasible by a tree-structured variational expectation-maximization algorithm. Here, we further develop an algorithmic enhancement to address the scalability of this approach, in order to make it applicable to large data sets, via a K-dimensional-tree based partitioning of the variational posterior assignments. This creates a non-trivial tradeoff between a more detailed noise model to enhance the detection accuracy, and the coarsened posterior representation to obtain computational speedup. Hence, we conduct extensive experimental validation to study the accuracy/ speed tradeoffs achievable in a variety of data conditions. We find that, at low-to-moderate error levels, a speedup factor that is at least linear in the number of data points can be achieved without significantly sacrificing the detection accuracy. The benefits of including measurement error information into the modeling is evident in all situations, and the gain roughly recovers the loss incurred by the speedup procedure in large error conditions. We analyze and discuss in detail the characteristics of our algorithm based on results obtained on appropriately designed synthetic data experiments, and we also demonstrate its working in a real application example.