Indexing High-Dimensional Data for Efficient In-Memory Similarity Search

Authors:
Bin Cui;Beng Chin Ooi;Jianwen Su;Kian-Lee Tan
Affiliations:
-;IEEE;IEEE;IEEE Computer Society
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 16
Cited 8

Distance-based indexing for high-dimensional metric spaces

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The pyramid-technique: towards breaking the curse of dimensionality

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Making B+- trees cache conscious in main memory

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Optimizing multidimensional index trees for main memory access

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Main-memory index structures with fixed-size partial keys

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Improving index performance through prefetching

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Fast Indexing and Visualization of Metric Data Sets using Slim-Trees

IEEE Transactions on Knowledge and Data Engineering
Similarity Search without Tears: The OMNI Family of All-purpose Access Methods

Proceedings of the 17th International Conference on Data Engineering
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Indexing the Distance: An Efficient Method to KNN Processing

Proceedings of the 27th International Conference on Very Large Data Bases
Contorting high dimensional data for efficient main memory KNN processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Main Memory Indexing: The Case for BD-Tree

IEEE Transactions on Knowledge and Data Engineering

iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

ACM Transactions on Database Systems (TODS)
Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space

Knowledge and Information Systems
Effectiveness of optimal incremental multi-step nearest neighbor search

Expert Systems with Applications: An International Journal
Indexing high-dimensional data for main-memory similarity search

Information Systems
Scalable kNN search on vertically stored time series

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Generalizing the k-Windows clustering algorithm in metric spaces

Mathematical and Computer Modelling: An International Journal
Boosting multi-kernel locality-sensitive hashing for scalable image retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Publishing microdata with a robust privacy guarantee

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In main memory systems, the L2 cache typically employs cache line sizes of 32-128 bytes. These values are relatively small compared to high-dimensional data, e.g., 32D. The consequence is that existing techniques (on low-dimensional data) that minimize cache misses are no longer effective. In this paper, we present a novel index structure, called \Delta{\hbox{-}}{\rm{tree}}, to speed up the high-dimensional query in main memory environment. The \Delta{\hbox{-}}{\rm{tree}} is a multilevel structure where each level represents the data space at different dimensionalities: the number of dimensions increases toward the leaf level. The remaining dimensions are obtained using Principal Component Analysis. Each level of the tree serves to prune the search space more efficiently as the lower dimensions can reduce the distance computation and better exploit the small cache line size. Additionally, the top-down clustering scheme can capture the feature of the data set and, hence, reduces the search space. We also propose an extension, called \Delta^+{\hbox{-}}{\rm{tree}}, that globally clusters the data space and then partitions clusters into small regions. The \Delta^+{\hbox{-}}{\rm{tree}} can further reduce the computational cost and cache misses. We conducted extensive experiments to evaluate the proposed structures against existing techniques on different kinds of data sets. Our results show that the \Delta^+{\hbox{-}}{\rm{tree}} is superior in most cases.