Persistent clustered main memory index for accelerating k-NN queries on high dimensional datasets

Authors:
Alexander Thomasian;Lijuan Zhang
Affiliations:
Computer Science Department, New Jersey Institute of Technology, Newark, USA 07102 and Thomasian and Associates, Pleasantville, USA 10570;Computer Science Department, New Jersey Institute of Technology, Newark, USA 07102 and AMICAS Inc., Boston, USA
Venue:
Multimedia Tools and Applications
Year:
2008

Citing 27
Cited 1

A Fast k Nearest Neighbor Finding Algorithm Based on the Ordered Partition

IEEE Transactions on Pattern Analysis and Machine Intelligence
The R*-tree: an efficient and robust access method for points and rectangles

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Nearest neighbor queries

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Efficiently supporting ad hoc queries in large datasets of time sequences

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Multidimensional access methods

ACM Computing Surveys (CSUR)
Clustering and singular value decomposition for approximate indexing in high dimensional spaces

Proceedings of the seventh international conference on Information and knowledge management
Distance browsing in spatial databases

ACM Transactions on Database Systems (TODS)
Scalability for clustering algorithms revisited

ACM SIGKDD Explorations Newsletter
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
Searching Multimedia Databases by Content

Searching Multimedia Databases by Content
Image Databases: Search and Retrieval of Digital Imagery

Image Databases: Search and Retrieval of Digital Imagery
The K-D-B-tree: a search structure for large multidimensional dynamic indexes

SIGMOD '81 Proceedings of the 1981 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Clustering for Approximate Similarity Search in High-Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
Similarity Indexing with the SS-tree

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Similarity Search without Tears: The OMNI Family of All-purpose Access Methods

Proceedings of the 17th International Conference on Data Engineering
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Ranking in Spatial Databases

SSD '95 Proceedings of the 4th International Symposium on Advances in Spatial Databases
CSVD: Clustering and Singular Value Decomposition for Approximate Similarity Search in High-Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
Rules of Thumb in Data Engineering

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Buffering databse operations for enhanced instruction cache performance

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Exact k-NN queries on clustered SVD datasets

Information Processing Letters
High-dimensional indexing methods utilizing clustering and dimensionality reduction

High-dimensional indexing methods utilizing clustering and dimensionality reduction
Multidimensional Binary Search Trees in Database Applications

IEEE Transactions on Software Engineering

Dimensionality reduction for similarity search with the Euclidean distance in high-dimensional applications

Multimedia Tools and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Similarity search implemented via k-nearest neighbor-- k-NN queries on multidimensional indices is an extremely useful paradigm for content-based image retrieval. As the dimensionality of feature vectors increases the curse of dimensionality sets in, i.e., the performance of k-NN search of disk-resident indices in the R-tree family degrades rapidly due to the overlap in index pages in high dimensions. This problem is dealt with in this study by utilizing the double filtering effect of clustering and indexing. The clustering algorithm ensures that the largest cluster fits into main memory and that only clusters closest to a query point need to be searched and hence loaded into main memory. We organize the data in each cluster according to the ordered-partition--OP-tree main memory resident index, which is not prone to the curse of dimensionality and highly efficient for processing k-NN queries. We serialize an OP-tree by writing its dynamically allocated nodes into contiguous memory locations, optimize its parameters, and make it persistent by writing it to disk. The time to read and write clusters constituting an OP-tree with a single sequential access to disk benefits from higher data transfer rates of modern disk drives. The performance of the index is further improved by applying the Karhunen---Loève transformation--KLT to the dataset, since this results in a more efficient computation of distances for k-NN queries. We compare OP-trees and sequential scans with and without a KL-transformation and with and without using a shortcut method in calculating Euclidean distances. A comparison against the OMNI-sequential scan is also reported. We finally compare a clustered and persistent version of the OP-tree against a clustered version of the SR-tree and the VA-file method. CPU time is measured and elapsed time is estimated in this study. It is observed that the OP-tree index outperforms the other two methods and that the improvement increases with the number of dimensions.