The anchors hierarchy: using the triangle inequality to survive high dimensional data

Authors:
Andrew W. Moore
Affiliations:
Carnegie Mellon University, Pittsburgh, PA
Venue:
UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Year:
2000

Citing 12
Cited 5

Bumptrees for efficient function, constraint, and classification learning

NIPS-3 Proceedings of the 1990 conference on Advances in neural information processing systems 3
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Fast discovery of association rules

Advances in knowledge discovery and data mining
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Very fast EM-based mixture model clustering using multiresolution kd-trees

Proceedings of the 1998 conference on Advances in neural information processing systems II
An Algorithm for Finding Best Matches in Logarithmic Expected Time

ACM Transactions on Mathematical Software (TOMS)
Efficient Locally Weighted Polynomial Regression Predictions

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Efficient Regular Data Structures and Algorithms for Location and Proximity Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Cached sufficient statistics for efficient machine learning with large datasets

Journal of Artificial Intelligence Research
Multiresolution instance-based learning

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

On autonomous k-means clustering

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Query-driven iterated neighborhood graph search for large scale indexing

Proceedings of the 20th ACM international conference on Multimedia
Posterior Expectation of Regularly Paved Random Histograms

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special Issue on Monte Carlo Methods in Statistics
CID: an efficient complexity-invariant distance for time series

Data Mining and Knowledge Discovery
Using Non-Zero Dimensions for the Cosine and Tanimoto Similarity Search Among Real Valued Vectors

Fundamenta Informaticae - To Andrzej Skowron on His 70th Birthday

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper is about metric data structures in high-dimensional or non-Euclidean space that permit cached sufficient statistics accelerations of learning algorithms. It has recently been shown that for less than about 10 dimensions, decorating kd-trees with additional "cached sufficient statistics" such as first and second moments and contingency tables can provide satisfying acceleration for a very wide range of statistical learning tasks such as kernel regression, locally weighted regression, k-means clustering, mixture modeling and Bayes Net learning. In this paper, we begin by defining the anchors hierarchy--a fast data structure and algorithm for localizing data based only on a triangle-inequality-obeying distance metric. We show how this, in its own right, gives a fast and effective clustering of data. But more importantly we show how it can produce a well-balanced structure similar to a Ball-Tree (Omohundro, 1991) or a kind of metric tree (Uhlmann, 1991; Ciaccia, Patella, & Zezula, 1997) in a way that is neither "topdown" nor "bottom-up" but instead "middleout". We then show how this structure, decorated with cached sufficient statistics, allows a wide variety of statistical learning algorithms to be accelerated even in thousands of dimensions.