Lower bounds on performance of metric tree indexing schemes for exact similarity search in high dimensions

  • Authors:
  • Vladimir Pestov

  • Affiliations:
  • Universidade Federal de Santa Catarina, Florianópolis-SC, Brasil and University of Ottawa, Ontario, Canada

  • Venue:
  • Proceedings of the Fourth International Conference on SImilarity Search and APplications
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Within a mathematically rigorous model borrowed from statistical learning theory, we analyse the curse of dimensionality for similarity based information retrieval in the context of popular indexing schemes: metric trees. The datasets X are sampled randomly from a domain Ω, equipped with a distance, ρ, and an underlying probability distribution, μ. While performing an asymptotic analysis, we send the intrinsic dimension d of Ω to infinity, and assume that the size of a dataset, n, grows superpolynomially yet subexponentially in d. Exact similarity search refers to finding the nearest neighbour in the dataset X to a query point ω ∈ Ω, where the query points are subject to the same probability distribution μ as datapoints. Let F denote a class of all 1-Lipschitz functions on Ω that can be used as decision functions in constructing a hierarchical metric tree indexing scheme. Suppose the VC dimension of all sets {ω: ƒ(ω) ≥ a}, a ∈ R is dO(1). (In view of a 1995 result of Goldberg and Jerrum, this is a reasonable complexity assumption.) We deduce superpolynomial in d lower bounds on the expected average case performance of hierarchical metric-tree based indexing schemes for exact similarity search in (Ω, X).