Finding near neighbors through cluster pruning

Authors:
Flavio Chierichetti;Alessandro Panconesi;Prabhakar Raghavan;Mauro Sozio;Alessandro Tiberi;Eli Upfal
Affiliations:
University of Rome;University of Rome;Yahoo! Research;University of Rome;University of Rome;Brown University
Venue:
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Year:
2007

Citing 19
Cited 10

Algorithms in combinatorial geometry

Algorithms in combinatorial geometry
Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Approximate closest-point queries in high dimensions

Information Processing Letters
Distance-based indexing for high-dimensional metric spaces

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SR-tree: an index structure for high-dimensional nearest neighbor queries

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Optimization of inverted vector searches

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Nearest neighbor queries in metric spaces

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An optimal algorithm for approximate nearest neighbor searching

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
On the geometry of similarity search: dimensionality curse and concentration of measure

Information Processing Letters
Ubiquitous B-Tree

ACM Computing Surveys (CSUR)
Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces

SIAM Journal on Computing
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Dynamic user-defined similarity searching in semi-structured text retrieval

Proceedings of the 3rd international conference on Scalable information systems
Videntifier™ forensic: a new law enforcement service for automatic identification of illegal video material

MiFor '09 Proceedings of the First ACM workshop on Multimedia in forensics
A flexible framework to ease nearest neighbor search in multidimensional data spaces

Data & Knowledge Engineering
Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
Tuning the capacity of search engines: Load-driven routing and incremental caching to reduce and balance the load

ACM Transactions on Information Systems (TOIS)
Mining Query Logs: Turning Search Usage Data into Knowledge

Foundations and Trends in Information Retrieval
A large-scale performance study of cluster-based high-dimensional indexing

Proceedings of the international workshop on Very-large-scale multimedia corpus, mining and retrieval
ATLAS: a probabilistic algorithm for high dimensional similarity search

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Impact of storage technology on the efficiency of cluster-based high-dimensional index creation

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications
Indexing and searching 100M images with map-reduce

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation. Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memo.