On the effects of dimensionality reduction on high dimensional similarity search

Authors:
Charu C. Aggarwal
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2001

Citing 16
Cited 22

The design and analysis of spatial data structures

The design and analysis of spatial data structures
Nearest neighbor queries

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Dimensionality reduction for similarity searching in dynamic databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Applications of linear algebra in information retrieval and hypertext analysis

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A similarity-based probability model for latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

On the Use of Conceptual Reconstruction for Mining Massively Incomplete Data Sets

IEEE Transactions on Knowledge and Data Engineering
GPCA: an efficient dimension reduction scheme for image compression and retrieval

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Generalized low rank approximations of matrices

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Finding Needles in Large-Scale Multivariate Data Haystacks

IEEE Computer Graphics and Applications
IDR/QR: An Incremental Dimension Reduction Algorithm via QR Decomposition

IEEE Transactions on Knowledge and Data Engineering
Browsing a document collection represented in two-and three-dimensional virtual information space

International Journal of Human-Computer Studies
Generalized Low Rank Approximations of Matrices

Machine Learning
On the impact of outliers on high-dimensional data analysis methods for face recognition

Proceedings of the 2nd international workshop on Computer vision meets databases
Towards automatic feature vector optimization for multimedia applications

Proceedings of the 2008 ACM symposium on Applied computing
Assessing the best integration between distance-function and image-feature to answer similarity queries

Proceedings of the 2008 ACM symposium on Applied computing
Robust detection of outliers for projection-based face recognition methods

Multimedia Tools and Applications
Efficient Processing of Nearest Neighbor Queries in Parallel Multimedia Databases

DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
Dimensionality reduction for similarity search with the Euclidean distance in high-dimensional applications

Multimedia Tools and Applications
SMVLLE: An Efficient Dimension Reduction Scheme

ISNN 2009 Proceedings of the 6th International Symposium on Neural Networks: Advances in Neural Networks - Part II
A statistical approach for selecting discriminative features of spatial regions of interest

Intelligent Data Analysis
Transforming range queries to equivalent box queries to optimize page access

Proceedings of the VLDB Endowment
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A fast hybrid classification algorithm based on the minimum distance and the k-NN classifiers

Proceedings of the Fourth International Conference on SImilarity Search and APplications
High-dimensional similarity search using data-sensitive space partitioning

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Improve top-k recommendation by extending review analysis

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
A data allocation method for efficient content-based retrieval in parallel multimedia databases

ISPA'07 Proceedings of the 2007 international conference on Frontiers of High Performance Computing and Networking
Dimensionality reduction in high-dimensional space for multimedia information retrieval

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The dimensionality curse has profound effects on the effectiveness of high-dimensional similarity indexing from the performance perspective. One of the well known techniques for improving the indexing performance is the method of dimensionality reduction. In this technique, the data is transformed to a lower dimensional space by finding a new axis-system in which most of the data variance is preserved in a few dimensions. This reduction may also have a positive effect on the quality of similarity for certain data domains such as text. For other domains, it may lead to loss of information and degradation of search quality. Recent research indicates that the improvement for the text domain is caused by the re-enforcement of the semantic concepts in the data. In this paper, we provide an intuitive model of the effects of dimensionality reduction on arbitrary high dimensional problems. We provide an effective diagnosis of the causality behind the qualitative effects of dimensionality reduction on a given data set. The analysis suggests that these effects are very data dependent. Our analysis also indicates that currently accepted techniques of picking the reduction which results in the least loss of information are useful for maximizing precision and recall, but are not necessarily optimum from a qualitative perspective. We demonstrate that by making simple changes to the implementation details of dimensionality reduction techniques, we can considerably improve the quality of similarity search.