New techniques for best-match retrieval
ACM Transactions on Information Systems (TOIS)
The R*-tree: an efficient and robust access method for points and rectangles
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Towards an analysis of range query performance in spatial data structures
PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension
PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Are window queries representative for arbitrary range queries?
PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A model for the prediction of R-tree performance
PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Fast parallel similarity search in multimedia databases
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SR-tree: an index structure for high-dimensional nearest neighbor queries
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A cost model for nearest neighbor search in high-dimensional data space
PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
On the analysis of indexing schemes
PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A lower bound theorem for indexing schemes and its application to multidimensional range queries
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A cost model for similarity queries in metric spaces
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
The pyramid-technique: towards breaking the curse of dimensionality
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Multidimensional access methods
ACM Computing Surveys (CSUR)
Lower bounds for high dimensional nearest neighbor search and related problems
STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
On two-dimensional indexability and optimal range search indexing
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Implications of certain assumptions in database performance evauation
ACM Transactions on Database Systems (TODS)
Spatial join selectivity using power laws
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The TV-tree: an index structure for high-dimensional data
The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Efficient Similarity Search In Sequence Databases
FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
STR: A Simple and Efficient Algorithm for R-Tree Packing
ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Fast Nearest Neighbor Search in High-Dimensional Space
ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Performance of Nearest Neighbor Queries in R-Trees
ICDT '97 Proceedings of the 6th International Conference on Database Theory
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects
VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
What Is the Nearest Neighbor in High Dimensional Spaces?
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Hilbert R-tree: An Improved R-tree using Fractals
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Generalized Search Trees for Database Systems
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Fast Nearest Neighbor Search in Medical Image Databases
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
SSD '95 Proceedings of the 4th International Symposium on Advances in Spatial Databases
Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Spatial indexing of high-dimensional data based on relative approximation
The VLDB Journal — The International Journal on Very Large Data Bases
Approximate Temporal Aggregation
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
IEEE Transactions on Knowledge and Data Engineering
Aggregate nearest neighbor queries in spatial databases
ACM Transactions on Database Systems (TODS)
Cost models for distance joins queries using R-trees
Data & Knowledge Engineering
Optimizing progressive query-by-example over pre-clustered large image databases
Proceedings of the 2nd international workshop on Computer vision meets databases
Distributed computation of the knn graph for large high-dimensional point sets
Journal of Parallel and Distributed Computing
A fast and effective method to find correlations among attributes in databases
Data Mining and Knowledge Discovery
An effective cost model for similarity queries in metric spaces
Proceedings of the 2007 ACM symposium on Applied computing
Genetic algorithms for approximate similarity queries
Data & Knowledge Engineering
The Concentration of Fractional Distances
IEEE Transactions on Knowledge and Data Engineering
The VLDB Journal — The International Journal on Very Large Data Bases
Multimedia Tools and Applications
Authenticating the query results of text search engines
Proceedings of the VLDB Endowment
Efficient indexing of interval time sequences
Information Processing Letters
ACM Transactions on Knowledge Discovery from Data (TKDD)
Partially materialized digest scheme: an efficient verification method for outsourced databases
The VLDB Journal — The International Journal on Very Large Data Bases
Quality and efficiency in high dimensional nearest neighbor search
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Easing the Dimensionality Curse by Stretching Metric Spaces
SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
A Fast Feature-Based Method to Detect Unusual Patterns in Multidimensional Datasets
DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
A fair assignment algorithm for multiple preference queries
Proceedings of the VLDB Endowment
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space
ACM Transactions on Database Systems (TODS)
Searching trajectories by locations: an efficiency study
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient processing of exact top-k queries over disk-resident sorted lists
The VLDB Journal — The International Journal on Very Large Data Bases
Can shared-neighbor distances defeat the curse of dimensionality?
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
A fast randomized method for local density-based outlier detection in high dimensional data
DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Rough set based approaches to feature selection for Case-Based Reasoning classifiers
Pattern Recognition Letters
Embellishing text search queries to protect user privacy
Proceedings of the VLDB Endowment
Estimating the indexability of multimedia descriptors for similarity searching
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
The Journal of Machine Learning Research
Large scale disk-based metric indexing structure for approximate information retrieval by content
Proceedings of the 1st Workshop on New Trends in Similarity Search
Nearest neighbor search on vertically partitioned high-dimensional data
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Improving the ranking quality of medical image retrieval using a genetic feature selection method
Decision Support Systems
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Towards enabling outlier detection in large, high dimensional data warehouses
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Proceedings of the 18th Brazilian symposium on Multimedia and the web
A survey on unsupervised outlier detection in high-dimensional numerical data
Statistical Analysis and Data Mining
RSS query algebra: Towards a better news management
Information Sciences: an International Journal
Causality and responsibility: probabilistic queries revisited in uncertain databases
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Spatial distance join based feature selection
Engineering Applications of Artificial Intelligence
Subspace clustering of high-dimensional data: an evolutionary approach
Applied Computational Intelligence and Soft Computing
Hi-index | 0.01 |
Spatial queries in high-dimensional spaces have been studied extensively recently. Among them, nearest-neighbor queries are important in many settings, including spatial databases (Find the $k$ closest cities) and multimedia databases (Find the $k$ most similar images). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious 驴curse of dimensionality.驴 Here, we show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the intrinsic dimensionality of the data set and not the dimensionality of the address space (referred to as the embedding dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic (驴fractal驴) dimensionalities that are much lower than their embedding dimension, e.g., due to subtle dependencies between attributes. In this paper, we show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets. The practical contributions of this work are our accurate formulas, which can be used for query optimization in spatial and multimedia databases. The major theoretical contribution is the 驴deflation驴 of the dimensionality curse: Our formulas and our experiments show that previous worst-case analyses of nearest-neighbor search in high dimensions are overpessimistic to the point of being unrealistic. The performance depends critically on the intrinsic (驴fractal驴) dimensionality as opposed to the embedding dimension that the uniformity and independence assumptions incorrectly imply.