Subspace clustering for indexing high dimensional data: a main memory index based on local reductions and individual multi-representations

  • Authors:
  • Stephan Günnemann;Hardy Kremer;Dominik Lenhard;Thomas Seidl

  • Affiliations:
  • RWTH Aachen University, Germany;RWTH Aachen University, Germany;RWTH Aachen University, Germany;RWTH Aachen University, Germany

  • Venue:
  • Proceedings of the 14th International Conference on Extending Database Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fast similarity search in high dimensional feature spaces is crucial in today's applications. Since the performance of traditional index structures degrades with increasing dimensionality, concepts were developed to cope with this curse of dimensionality. Most of the existing concepts exploit global correlations between dimensions to reduce the dimensionality of the feature space. In high dimensional data, however, correlations are often locally constrained to a subset of the data and every object can participate in several of these correlations. Accordingly, discarding the same set of dimensions for each object based on global correlations and ignoring the different correlations of single objects leads to significant loss of information. These aspects are relevant due to the direct correspondence between the degree of information preserved and the achievable query performance. We introduce a novel main memory index structure with increased information content for each single object compared to a global approach. This is achieved by using individual dimensions for each data object by applying the method of subspace clustering. The structure of our index is based on a multi-representation of objects reflecting their multiple correlations; that is, besides the general increase of information per object, we provide several individual representations for each single data object. These multiple views correspond to different local reductions per object and enable more effective pruning. In thorough experiments on real and synthetic data, we demonstrate that our novel solution achieves low query times and outperforms existing approaches designed for high dimensional data.