On locality sensitive hashing in metric spaces

  • Authors:
  • Eric Sadit Tellez;Edgar Chavez

  • Affiliations:
  • Universidad Michoacana, México;Universidad Michoacana / CICESE, México

  • Venue:
  • Proceedings of the Third International Conference on SImilarity Search and APplications
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Modeling proximity search problems as a metric space provides a general framework usable in many areas, like pattern recognition, web search, clustering, data mining, knowledge management, textual and multimedia information retrieval, to name a few. Metric indexes have been improved over the years and many instances of the problem can be solved efficiently. However, when very large/high dimensional metric databases are indexed exact approaches are not yet capable of solving efficiently the problem, the performance in these circumstances is degraded to almost sequential search. To overcome the above limitation, non-exact proximity searching algorithms can be used to give answers that either in probability or in an approximation factor are close to the exact result. Approximation is acceptable in many contexts, specially when human judgement about closeness is involved. In vector spaces, on the other hand, there is a very successful approach dubbed Locality Sensitive Hashing which consist in making a succinct representation of the objects. This succinct representation is relatively insensitive to small variations of the locality. Unfortunately, the hashing function have to be carefully designed, very close to the data model, and different functions are used when objects come from different domains. In this paper we give a new schema to encode objects in a general metric space with a uniform framework, independent from the data model. Finally, we provide experimental support to our claims using several real life databases with different data models and distance functions obtaining excellent results in both the speed and the recall sense, specially for large databases.