Learning "forgiving" hash functions: algorithms and large scale tests

Authors:
Shumeet Baluja;Michele Covell
Affiliations:
Google Inc., Mountain View, CA;Google Inc., Mountain View, CA
Venue:
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Year:
2007

Citing 8
Cited 3

Recursive distributed representations

Artificial Intelligence - On connectionist symbol processing
Discriminant Adaptive Nearest Neighbor Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Introduction to the Theory of Neural Computation

Introduction to the Theory of Neural Computation
Adjustment Learning and Relevant Component Analysis

ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Computer Vision for Music Identification

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Learning task-specific similarity

Learning task-specific similarity
Learning distance function by coding similarity

Proceedings of the 24th international conference on Machine learning

Learning to hash: forgiving hash functions and applications

Data Mining and Knowledge Discovery
Combined script and page orientation estimation using the Tesseract OCR engine

Proceedings of the International Workshop on Multilingual OCR
Probabilistic near-duplicate detection using simhash

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of efficiently finding similar items in a large corpus of high-dimensional data points arises in many real-world tasks, such as music, image, and video retrieval. Beyond the scaling difficulties that arise with lookups in large data sets, the complexity in these domains is exacerbated by an imprecise definition of similarity. In this paper, we describe a method to learn a similarity function from only weakly labeled positive examples. Once learned, this similarity function is used as the basis of a hash function to severely constrain the number of points considered for each lookup. Tested on a large real-world audio dataset, only a tiny fraction of the points (∼0.27%) are ever considered for each lookup. To increase efficiency, no comparisons in the original high-dimensional space of points are required. The performance far surpasses, in terms of both efficiency and accuracy, a state-of-the-art Locality-Sensitive-Hashing based technique for the same problem and data set.