The Stepwise Dimensionality Increasing (SDI) Index for High-Dimensional Data

  • Authors:
  • Alexander Thomasian;Lijuan Zhang

  • Affiliations:
  • *Corresponding author: alexthomasian@gmail.com;Department of Computer Science, New Jersey Institute of Technology Newark, NJ 07102, USA

  • Venue:
  • The Computer Journal
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Similarity search is a powerful paradigm for image and multimedia databases, time series databases, and DNA and protein sequence databases. Objects are represented by high-dimensional feature vectors based on color, texture, and shape, in the case of images, for example object similarity is usually implemented via k-nearest neighbor (k-NN) queries, determined by the distance of the endpoints of the feature vectors. The cost of processing k-NN queries via a sequential scan increases with the number of objects and the number of dimensions. Multi-dimensional indexing structures can be used to improve the efficiency of k-NN query processing, but lose their effectiveness as the dimensionality increases. The curse of dimensionality manifests itself in the form of increased overlap among the nodes of the index, so that a high fraction of index pages are touched in processing k-NN queries. The increased dimensionality results in a reduced fanout and an increased index height. We propose a stepwise dimensionality increasing (SDI)-tree index, which aims at reducing the number of disk accesses and CPU processing cost. The index is built using feature vectors transformed via principal component analysis. Dimensions are retained in non-increasing order of their variance according to a parameter p, which specifies the incremental fraction of variance at each level of the index. The optimal value for p is determined experimentally. Experiments on three datasets have shown that SDI-trees access fewer disk pages and incur less CPU time than SR-trees, VAMSR-trees, vector approximation (VA)-Files and the recently proposed iDistance method. In CPU time SDI outperforms the sequential scan and OMNI methods.