Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search

Authors:
Charu C. Aggarwal
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Year:
2002

Citing 15
Cited 8

The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Nearest neighbor queries

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Database-friendly random projections

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Independence is good: dependency-based histogram synopses for high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
SPARTAN: a model-based semantic compression system for massive data tables

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Semantic Compression and Pattern Extraction with Fascicles

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
GPCA: an efficient dimension reduction scheme for image compression and retrieval

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An Efficient Subspace Sampling Framework for High-Dimensional Data Reduction, Selectivity Estimation, and Nearest-Neighbor Search

IEEE Transactions on Knowledge and Data Engineering
On accessing data in high-dimensional spaces: a comparative study of three space partitioning strategies

Journal of Systems and Software - Special issue: Performance modeling and analysis of computer systems and networks
Adaptive non-linear clustering in data streams

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
CACS: A Novel Classification Algorithm Based on Concept Similarity

ADMA '07 Proceedings of the 3rd international conference on Advanced Data Mining and Applications
Similarity computing model of high dimension data for symptom classification of Chinese traditional medicine

Applied Soft Computing
High-dimensional similarity search using data-sensitive space partitioning

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increased abilities for automated data collection made possible by modern technology, the typical sizes of data collections have continued to grow in recent years. In such cases, it may be desirable to store the data in a reduced format in order to improve the storage, transfer time, and processing requirements on the data. One of the challenges of designing effective data compression techniques is to be able to preserve the ability to use the reduced format directly for a wide range of database and data mining applications. In this paper, we propose the novel idea of hierarchical subspace sampling in order to create a reduced representation of the data. The method is naturally able to estimate the local implicit dimensionalities of each point very effectively, and thereby create a variable dimensionality reduced representation of the data. Such a technique has the advantage that it is very adaptive about adjusting its representation depending upon the behavior of the immediate locality of a data point. An interesting property of the subspace sampling technique is that unlike all other data reduction techniques, the overall efficiency of compression improves with increasing database size. This is a highly desirable property for any data reduction system since the problem itself is motivated by the large size of data sets. Because of its sampling approach, the procedure is extremely fast and scales linearly both with data set size and dimensionality. Furthermore, the subspace sampling technique is able to reveal important local subspace characteristics of high dimensional data which can be harnessed for effective solutions to problems such as selectivity estimation and approximate nearest neighbor search.