Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees

Authors:
Caetano Traina Jr.
Affiliations:
-
Venue:
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Year:
2000

Citing 0
Cited 8

Searching in metric spaces with user-defined and approximate distances

ACM Transactions on Database Systems (TODS)
Fast Indexing and Visualization of Metric Data Sets using Slim-Trees

IEEE Transactions on Knowledge and Data Engineering
String Matching with Metric Trees Using an Approximate Distance

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Genetic algorithms for approximate similarity queries

Data & Knowledge Engineering
The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient

The VLDB Journal — The International Journal on Very Large Data Bases
Estimating the selectivity of tf-idf based cosine similarity predicates

ACM SIGMOD Record
Estimating the selectivity of tf-idf based cosine similarity predicates

ACM SIGMOD Record
Time-Aware Similarity Search: A Metric-Temporal Representation for Complex Data

SSTD '09 Proceedings of the 11th International Symposium on Advances in Spatial and Temporal Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper discusses the problem of selectivity estimation for range queries in metric datasets, which include vector, or dimensional, datasets as a special case. The main contribution of this paper is that, surprisingly, many different real datasets follow a "power law". From this observation we derive an analysis for the distance distribution of metric datasets. This is the first analysis of distance distributions for real metric datasets.We called the exponent of our power law as "distance exponent". We show that it plays a relevant role for the analysis of real, metric datasets. Specifically, we show (a) how to exploit the distance exponent to derive formulas for selectivity estimation of range queries and (b) how to compute it quickly from a metric index tree.We performed several experiments on many real datasets (road intersections of U.S. counties, vectors characteristics extracted from face matching systems, sets of words, distance matrixes) and synthetic datasets (Sierpinsky triangle, a 2-dimensional uniform distribution and a 2-dimensional line). Our selectivity estimation formulas are accurate, within relative error from 4% to 17%, and always within one standard deviation from the analytical results. Moreover, we present also a quick algorithm to estimate the "distance exponent", which gives good accuracy and saves orders of magnitude in computation time.