An Efficient Subspace Sampling Framework for High-Dimensional Data Reduction, Selectivity Estimation, and Nearest-Neighbor Search

Authors:
Charu C. Aggarwal
Affiliations:
IEEE
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2004

Citing 26
Cited 2

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Nearest neighbor queries

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Bagging predictors

Machine Learning
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Dimensionality reduction for similarity searching in dynamic databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A comparison of DFT and DWT based similarity search in time-series databases

Proceedings of the ninth international conference on Information and knowledge management
Database-friendly random projections

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Locally adaptive dimensionality reduction for indexing large time series databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Independence is good: dependency-based histogram synopses for high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
SPARTAN: a model-based semantic compression system for massive data tables

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Ensemble-index: a new approach to indexing large databases

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Random projection in dimensionality reduction: applications to image and text data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Semantic Compression and Pattern Extraction with Fascicles

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Efficient Time Series Matching by Wavelets

ICDE '99 Proceedings of the 15th International Conference on Data Engineering

Efficient Processing of Nearest Neighbor Queries in Parallel Multimedia Databases

DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
A data allocation method for efficient content-based retrieval in parallel multimedia databases

ISPA'07 Proceedings of the 2007 international conference on Frontiers of High Performance Computing and Networking

Quantified Score

Hi-index	0.01

Visualization

Abstract

Data reduction can improve the storage, transfer time, and processing requirements of very large data sets. One of the challenges of designing effective data reduction techniques is to be able to preserve the ability to use the reduced format directly for a wide range of database and data mining applications. In this paper, we propose the novel idea of hierarchical subspace sampling in order to create a reduced representation of the data. The method is naturally able to estimate the local implicit dimensionalities of each point very effectively and, thereby, create a variable dimensionality reduced representation of the data. Such a technique is very adaptive about adjusting its representation depending upon the behavior of the immediate locality of a data point. An important property of the subspace sampling technique is that the overall efficiency of compression improves with increasing database size. Because of its sampling approach, the procedure is extremely fast and scales linearly both with data set size and dimensionality. We propose new and effective solutions to problems such as selectivity estimation and approximate nearest-neighbor search. These are achieved by utilizing the locality specific subspace characteristics of the data which are revealed by the subspace sampling technique.