Indexing and searching 100M images with map-reduce

Authors:
Diana Moise;Denis Shestakov;Gylfi Gudmundsson;Laurent Amsaleg
Affiliations:
INRIA, Rennes, France;INRIA, Rennes, France;INRIA, Rennes, France;IRISA - CNRS, Rennes, France
Venue:
Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Year:
2013

Citing 19
Cited 0

Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Video Google: A Text Retrieval Approach to Object Matching in Videos

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Finding near neighbors through cluster pruning

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A posteriori multi-probe locality sensitive hashing

MM '08 Proceedings of the 16th ACM international conference on Multimedia
NV-Tree: An Efficient Disk-Based Index for Approximate Search in Very Large High-Dimensional Collections

IEEE Transactions on Pattern Analysis and Machine Intelligence
Evaluation of GIST descriptors for web-scale image search

Proceedings of the ACM International Conference on Image and Video Retrieval
Building a web-scale image similarity search system

Multimedia Tools and Applications
Locality sensitive hashing: A comparison of hash function types and querying mechanisms

Pattern Recognition Letters
A large-scale performance study of cluster-based high-dimensional indexing

Proceedings of the international workshop on Very-large-scale multimedia corpus, mining and retrieval
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Product Quantization for Nearest Neighbor Search

IEEE Transactions on Pattern Analysis and Machine Intelligence
NV-Tree: nearest neighbors at the billion scale

Proceedings of the 1st ACM International Conference on Multimedia Retrieval
ImageTerrier: an extensible platform for scalable high-performance image retrieval

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most researchers working on high-dimensional indexing agree on the following three trends: (i) the size of the multimedia collections to index are now reaching millions if not billions of items, (ii) the computers we use every day now come with multiple cores and (iii) hardware becomes more available, thanks to easier access to Grids and/or Clouds. This paper shows how the Map-Reduce paradigm can be applied to indexing algorithms and demonstrates that great scalability can be achieved using Hadoop, a popular Map-Reduce-based framework. Dramatic performance improvements are not however guaranteed a priori: such frameworks are rigid, they severely constrain the possible access patterns to data and scares resource RAM has to be shared. Furthermore, algorithms require major redesign, and may have to settle for sub-optimal behavior. The benefits, however, are many: simplicity for programmers, automatic distribution, fault tolerance, failure detection and automatic re-runs and, last but not least, scalability. We share our experience of adapting a clustering-based high-dimensional indexing algorithm to the Map-Reduce model, and of testing it at large scale with Hadoop as we index 30 billion SIFT descriptors. We foresee that lessons drawn from our work could minimize time, effort and energy invested by other researchers and practitioners working in similar directions.