Efficient Similarity Search by Reducing I/O with Compressed Sketches

Authors:
Arnoldo José Muller-Molina;Takeshi Shinohara
Affiliations:
-;-
Venue:
SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
Year:
2009

Citing 26
Cited 0

An algorithm for finding nearest neighbours in (approximately) constant average time

Pattern Recognition Letters
Redundancy in spatial databases

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Vector approximation based indexing for non-uniform high dimensional data sets

Proceedings of the ninth international conference on Information and knowledge management
Signature files: an access method for documents and its analytical performance evaluation

ACM Transactions on Information Systems (TOIS)
Parallel traversal of signature trees for fast CBIR

MULTIMEDIA '01 Proceedings of the 2001 ACM workshops on Multimedia: multimedia information retrieval
Searching in metric spaces

ACM Computing Surveys (CSUR)
Compression: A Key for Next-Generation Text Retrieval Systems

Computer
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
On Dimension Reduction Mappings for Approximate Retrieval of Multi-dimensional Data

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Properties of Embedding Methods for Similarity Searching in Metric Spaces

IEEE Transactions on Pattern Analysis and Machine Intelligence
Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
D-Index: Distance Searching Index for Metric Data Sets

Multimedia Tools and Applications
Pivot selection techniques for proximity searching in metric spaces

Pattern Recognition Letters
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
Image similarity search with compact data structures

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)

Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
A compact space decomposition for effective metric indexing

Pattern Recognition Letters
Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
Efficient filtering with sketches in the ferret toolkit

MIR '06 Proceedings of the 8th ACM international workshop on Multimedia information retrieval
On approximate matching of programs for protecting libre software

CASCON '06 Proceedings of the 2006 conference of the Center for Advanced Studies on Collaborative research
Sizing sketches: a rank-based analysis for similarity search

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Fast Approximate Matching of Programs for Protecting Libre/Open Source Software by Using Spatial Indexes

SCAM '07 Proceedings of the Seventh IEEE International Working Conference on Source Code Analysis and Manipulation
Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Effective Proximity Retrieval by Ordering Permutations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sketches are compact bit string representations of objects. Objects that have the same sketch are stored in the same database bucket. By calculating the hamming distance of the sketches, an estimation of the similarity of their respective objects can be obtained. Objects that are close to each other are expected to have sketches with small hamming distance values. This estimation helps to schedule the order in which buckets are visited during search time. Recent research has shown that sketches can effectively approximate $L_1$ and $L_2$ distances in high dimensional settings. A remaining task is to provide a general sketch for arbitrary metric spaces. This paper presents a novel sketch based on generalized hyperplane partitioning that can be employed on arbitrary metric spaces. The core of the sketch is a heuristic that tries to generate balanced partitions. The indexing method AESA stores all the distances among database objects, and this allows it to perform a small number of distance computations. Experimental evaluations show that given a good early termination strategy, our algorithm performs up to one order of magnitude fewer distance operations than AESA in string spaces. Comparisons against other methods show greater gains. Furthermore, we experimentally demonstrate that it is possible to reduce the physical size of the sketches by a factor of ten with different run length encodings.