CM-tree: A dynamic clustered index for similarity search in metric databases

Authors:
Lior Aronovich;Israel Spiegler
Affiliations:
Information Systems Department, Tel Aviv University, P.O.B. 39010, Ramat Aviv, Tel Aviv 69978, Israel;Information Systems Department, Tel Aviv University, P.O.B. 39010, Ramat Aviv, Tel Aviv 69978, Israel
Venue:
Data & Knowledge Engineering
Year:
2007

Citing 24
Cited 1

An algorithm for finding nearest neighbours in (approximately) constant average time

Pattern Recognition Letters
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A cost model for similarity queries in metric spaces

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Multidimensional access methods

ACM Computing Surveys (CSUR)
Indexing large metric spaces for similarity search queries

ACM Transactions on Database Systems (TODS)
Data clustering: a review

ACM Computing Surveys (CSUR)
Ubiquitous B-Tree

ACM Computing Surveys (CSUR)
Some approaches to best-match file searching

Communications of the ACM
Principles of data mining

Principles of data mining
Searching in metric spaces

ACM Computing Surveys (CSUR)
Fast Indexing and Visualization of Metric Data Sets using Slim-Trees

IEEE Transactions on Knowledge and Data Engineering
M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Searching in metric spaces by spatial approximation

The VLDB Journal — The International Journal on Very Large Data Bases
An Effective Clustering Algorithm to Index High Dimensional Metric Spaces

SPIRE '00 Proceedings of the Seventh International Symposium on String Processing Information Retrieval (SPIRE'00)
Metric-Based Shape Retrieval in Large Databases

ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 3 - Volume 3
Pivot selection techniques for proximity searching in metric spaces

Pattern Recognition Letters
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
iDistance: An adaptive B+-tree based indexing method for nearest neighbor search

ACM Transactions on Database Systems (TODS)
BoostMap: a method for efficient approximate similarity rankings

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Practical construction of k-nearest neighbor graphs in metric spaces

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
On the least cost for proximity searching in metric spaces

WEA'06 Proceedings of the 5th international conference on Experimental Algorithms
Using the k-nearest neighbor graph for proximity searching in metric spaces

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

Bulk construction of dynamic clustered metric trees

Knowledge and Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.