On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'

Authors:
Flip Korn;Bernd-Uwe Pagel;Christos Faloutsos
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2001

Citing 38
Cited 45

New techniques for best-match retrieval

ACM Transactions on Information Systems (TOIS)
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Towards an analysis of range query performance in spatial data structures

PODS '93 Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Are window queries representative for arbitrary range queries?

PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A model for the prediction of R-tree performance

PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Fast parallel similarity search in multimedia databases

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A cost model for nearest neighbor search in high-dimensional data space

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
On the analysis of indexing schemes

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A lower bound theorem for indexing schemes and its application to multidimensional range queries

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A cost model for similarity queries in metric spaces

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
The pyramid-technique: towards breaking the curse of dimensionality

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Multidimensional access methods

ACM Computing Surveys (CSUR)
Lower bounds for high dimensional nearest neighbor search and related problems

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
On two-dimensional indexability and optimal range search indexing

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Implications of certain assumptions in database performance evauation

ACM Transactions on Database Systems (TODS)
Spatial join selectivity using power laws

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Intelligent Access to Digital Video: Informedia Project

Computer
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
STR: A Simple and Efficient Algorithm for R-Tree Packing

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Fast Nearest Neighbor Search in High-Dimensional Space

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Performance of Nearest Neighbor Queries in R-Trees

ICDT '97 Proceedings of the 6th International Conference on Database Theory
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Hilbert R-tree: An Improved R-tree using Fractals

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Generalized Search Trees for Database Systems

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Fast Nearest Neighbor Search in Medical Image Databases

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Ranking in Spatial Databases

SSD '95 Proceedings of the 4th International Symposium on Advances in Spatial Databases
Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces

ICDE '00 Proceedings of the 16th International Conference on Data Engineering

Spatial indexing of high-dimensional data based on relative approximation

The VLDB Journal — The International Journal on Very Large Data Bases
Approximate Temporal Aggregation

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
An Efficient Cost Model for Optimization of Nearest Neighbor Search in Low and Medium Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
Aggregate nearest neighbor queries in spatial databases

ACM Transactions on Database Systems (TODS)
Cost models for distance joins queries using R-trees

Data & Knowledge Engineering
Optimizing progressive query-by-example over pre-clustered large image databases

Proceedings of the 2nd international workshop on Computer vision meets databases
Shared farthest neighbor approach to clustering of high dimensionality, low cardinality data

Pattern Recognition
Distributed computation of the knn graph for large high-dimensional point sets

Journal of Parallel and Distributed Computing
A fast and effective method to find correlations among attributes in databases

Data Mining and Knowledge Discovery
An effective cost model for similarity queries in metric spaces

Proceedings of the 2007 ACM symposium on Applied computing
Genetic algorithms for approximate similarity queries

Data & Knowledge Engineering
The Concentration of Fractional Distances

IEEE Transactions on Knowledge and Data Engineering
The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient

The VLDB Journal — The International Journal on Very Large Data Bases
Optimal subspace dimensionality for k-nearest-neighbor queries on clustered and dimensionality reduced datasets with SVD

Multimedia Tools and Applications
Authenticating the query results of text search engines

Proceedings of the VLDB Endowment
Efficient indexing of interval time sequences

Information Processing Letters
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Partially materialized digest scheme: an efficient verification method for outsourced databases

The VLDB Journal — The International Journal on Very Large Data Bases
Quality and efficiency in high dimensional nearest neighbor search

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Easing the Dimensionality Curse by Stretching Metric Spaces

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
A Fast Feature-Based Method to Detect Unusual Patterns in Multidimensional Datasets

DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
A fair assignment algorithm for multiple preference queries

Proceedings of the VLDB Endowment
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

ACM Transactions on Database Systems (TODS)
Searching trajectories by locations: an efficiency study

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient processing of exact top-k queries over disk-resident sorted lists

The VLDB Journal — The International Journal on Very Large Data Bases
Slicing the metric space to provide quick indexing of complex data in the main memory

Information Systems
Can shared-neighbor distances defeat the curse of dimensionality?

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
A fast randomized method for local density-based outlier detection in high dimensional data

DaWaK'10 Proceedings of the 12th international conference on Data warehousing and knowledge discovery
Instant code clone search

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Rough set based approaches to feature selection for Case-Based Reasoning classifiers

Pattern Recognition Letters
Embellishing text search queries to protect user privacy

Proceedings of the VLDB Endowment
Estimating the indexability of multimedia descriptors for similarity searching

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

The Journal of Machine Learning Research
Large scale disk-based metric indexing structure for approximate information retrieval by content

Proceedings of the 1st Workshop on New Trends in Similarity Search
Nearest neighbor search on vertically partitioned high-dimensional data

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Improving the ranking quality of medical image retrieval using a genetic feature selection method

Decision Support Systems
Subspace clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Towards enabling outlier detection in large, high dimensional data warehouses

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Reducing the dimensionality of the SIFT descriptor and increasing its effectiveness and efficiency in image retrieval via bag-of-features

Proceedings of the 18th Brazilian symposium on Multimedia and the web
A survey on unsupervised outlier detection in high-dimensional numerical data

Statistical Analysis and Data Mining
Efficient processing of probabilistic group subspace skyline queries in uncertain databases

Information Systems
RSS query algebra: Towards a better news management

Information Sciences: an International Journal
Causality and responsibility: probabilistic queries revisited in uncertain databases

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Spatial distance join based feature selection

Engineering Applications of Artificial Intelligence
Subspace clustering of high-dimensional data: an evolutionary approach

Applied Computational Intelligence and Soft Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Spatial queries in high-dimensional spaces have been studied extensively recently. Among them, nearest-neighbor queries are important in many settings, including spatial databases (Find the $k$ closest cities) and multimedia databases (Find the $k$ most similar images). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious 驴curse of dimensionality.驴 Here, we show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the intrinsic dimensionality of the data set and not the dimensionality of the address space (referred to as the embedding dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic (驴fractal驴) dimensionalities that are much lower than their embedding dimension, e.g., due to subtle dependencies between attributes. In this paper, we show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets. The practical contributions of this work are our accurate formulas, which can be used for query optimization in spatial and multimedia databases. The major theoretical contribution is the 驴deflation驴 of the dimensionality curse: Our formulas and our experiments show that previous worst-case analyses of nearest-neighbor search in high dimensions are overpessimistic to the point of being unrealistic. The performance depends critically on the intrinsic (驴fractal驴) dimensionality as opposed to the embedding dimension that the uniformity and independence assumptions incorrectly imply.