Pivot selection: Dimension reduction for distance-based indexing

Authors:
Rui Mao;Willard L. Miranker;Daniel P. Miranker
Affiliations:
Shenzhen University, 3688 Nanhai Rd., Office Tower #342, Shenzhen, Guangdong, 518060, China;Yale University, 227 Church Street, PH2E, New Haven, CT 06510, USA;University of Texas at Austin, 1 University station, C0500, Austin, TX 78712, USA
Venue:
Journal of Discrete Algorithms
Year:
2012

Citing 23
Cited 1

EM algorithms for PCA and SPCA

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Indexing large metric spaces for similarity search queries

ACM Transactions on Database Systems (TODS)
Multidimensional binary search trees used for associative searching

Communications of the ACM
Searching in metric spaces

ACM Computing Surveys (CSUR)
Lectures on Discrete Geometry

Lectures on Discrete Geometry
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Searching in Metric Spaces by Spatial Approximation

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Pivot selection techniques for proximity searching in metric spaces

Pattern Recognition Letters
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
On Optimizing Distance-Based Similarity Search for Biological Databases

CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
A metric model of amino acid substitution

Bioinformatics
High dimensional nearest neighbor searching

Information Systems
A fast coarse filtering method for peptide identification by mass spectrometry

Bioinformatics
Efficient index-based KNN join processing for high-dimensional data

Information and Software Technology
Algorithms for Nearest Neighbor Search on Moving Object Trajectories

Geoinformatica
Analyzing Metric Space Indexes: What For?

SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
When is nearest neighbors indexable?

ICDT'05 Proceedings of the 10th international conference on Database Theory

Flexible and efficient string similarity search with alignment-space transform

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distance-based indexing exploits only the triangle inequality to answer similarity queries in metric spaces. Lacking coordinate structure, mathematical tools in R^n can only be applied indirectly, making it difficult to theoretically study metric-space indexing. Toward solving this problem, a common algorithmic step is to select a small number of special points, called pivots, and map the data objects to a low-dimensional space, one dimension for each pivot, where each dimension represents the distances of a pivot to the data objects. We formalize a ''pivot space model'' where all the data objects are used as pivots such that data is mapped from metric space to R^n, preserving all the pairwise distances under L^~. With this model, it can be shown that the indexing problem in metric space can be equivalently studied in R^n. Further, we show the necessity of dimension reduction for R^n and that the only effective form of dimension reduction is to select existing dimensions, i.e. pivot selection. The coordinate structure of R^n makes the application of many mathematical tools possible. In particular, Principle Component Analysis (PCA) is incorporated into a heuristic method for pivot selection and shown to be effective over a large range of workloads. We also show that PCA can be used to reliably measure the intrinsic dimension of a metric space.