Pivot selection: Dimension reduction for distance-based indexing

  • Authors:
  • Rui Mao;Willard L. Miranker;Daniel P. Miranker

  • Affiliations:
  • Shenzhen University, 3688 Nanhai Rd., Office Tower #342, Shenzhen, Guangdong, 518060, China;Yale University, 227 Church Street, PH2E, New Haven, CT 06510, USA;University of Texas at Austin, 1 University station, C0500, Austin, TX 78712, USA

  • Venue:
  • Journal of Discrete Algorithms
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Distance-based indexing exploits only the triangle inequality to answer similarity queries in metric spaces. Lacking coordinate structure, mathematical tools in R^n can only be applied indirectly, making it difficult to theoretically study metric-space indexing. Toward solving this problem, a common algorithmic step is to select a small number of special points, called pivots, and map the data objects to a low-dimensional space, one dimension for each pivot, where each dimension represents the distances of a pivot to the data objects. We formalize a ''pivot space model'' where all the data objects are used as pivots such that data is mapped from metric space to R^n, preserving all the pairwise distances under L^~. With this model, it can be shown that the indexing problem in metric space can be equivalently studied in R^n. Further, we show the necessity of dimension reduction for R^n and that the only effective form of dimension reduction is to select existing dimensions, i.e. pivot selection. The coordinate structure of R^n makes the application of many mathematical tools possible. In particular, Principle Component Analysis (PCA) is incorporated into a heuristic method for pivot selection and shown to be effective over a large range of workloads. We also show that PCA can be used to reliably measure the intrinsic dimension of a metric space.