A view of the EM algorithm that justifies incremental, sparse, and other variants
Proceedings of the NATO Advanced Study Institute on Learning in graphical models
Accelerating exact k-means algorithms with geometric reasoning
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
LOF: identifying density-based local outliers
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An Introduction to Variational Methods for Graphical Models
Machine Learning
Very fast EM-based mixture model clustering using multiresolution kd-trees
Proceedings of the 1998 conference on Advances in neural information processing systems II
Multidimensional binary search trees used for associative searching
Communications of the ACM
Stochastic Complexity in Statistical Inquiry Theory
Stochastic Complexity in Statistical Inquiry Theory
Robust mixture modelling using the t distribution
Statistics and Computing
Algorithms for Mining Distance-Based Outliers in Large Datasets
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Learning Mixtures of Gaussians
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Learning from Incomplete Data
Outlier Mining in Large High-Dimensional Data Sets
IEEE Transactions on Knowledge and Data Engineering
Robust probabilistic projections
ICML '06 Proceedings of the 23rd international conference on Machine learning
An introduction to ROC analysis
Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Accelerated EM-based clustering of large data sets
Data Mining and Knowledge Discovery
Neural Networks
Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution
Computational Statistics & Data Analysis
Robust mixtures in the presence of measurement errors
Proceedings of the 24th international conference on Machine learning
Robust fuzzy clustering using mixtures of Student's-t distributions
Pattern Recognition Letters
Cached sufficient statistics for efficient machine learning with large datasets
Journal of Artificial Intelligence Research
Factor analysis latent subspace modeling and robust fuzzy clustering using t-distributions
IEEE Transactions on Fuzzy Systems
Variational Bayesian sparse kernel-based blind image deconvolution with student's-t priors
IEEE Transactions on Image Processing
Robust Bayesian mixture modelling
Neurocomputing
Signal Modeling and Classification Using a Robust Latent Space Model Based on Distributions
IEEE Transactions on Signal Processing
Variational Bayesian Image Restoration Based on a Product of -Distributions Image Prior
IEEE Transactions on Image Processing
Variational learning and bits-back coding: an information-theoretic view to Bayesian learning
IEEE Transactions on Neural Networks
Distributed EM Algorithm for Gaussian Mixtures in Sensor Networks
IEEE Transactions on Neural Networks
Robust mixture clustering using Pearson type VII distribution
Pattern Recognition Letters
Hi-index | 0.00 |
In experimental and observational sciences, detecting atypical, peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domain-specific followup study. However, measurement data is nearly never free of measurement errors. These errors can generate false outliers that are not truly interesting. Although many approaches exist for finding outliers, they have no means to tell to what extent the peculiarity is not simply due to measurement errors. To address this issue, we have developed a model-based approach to infer genuine outliers from multivariate data sets when measurement error information is available. This is based on a probabilistic mixture of hierarchical density models, in which parameter estimation is made feasible by a tree-structured variational expectation-maximization algorithm. Here, we further develop an algorithmic enhancement to address the scalability of this approach, in order to make it applicable to large data sets, via a K-dimensional-tree based partitioning of the variational posterior assignments. This creates a non-trivial tradeoff between a more detailed noise model to enhance the detection accuracy, and the coarsened posterior representation to obtain computational speedup. Hence, we conduct extensive experimental validation to study the accuracy/ speed tradeoffs achievable in a variety of data conditions. We find that, at low-to-moderate error levels, a speedup factor that is at least linear in the number of data points can be achieved without significantly sacrificing the detection accuracy. The benefits of including measurement error information into the modeling is evident in all situations, and the gain roughly recovers the loss incurred by the speedup procedure in large error conditions. We analyze and discuss in detail the characteristics of our algorithm based on results obtained on appropriately designed synthetic data experiments, and we also demonstrate its working in a real application example.