Algorithms in combinatorial geometry
Algorithms in combinatorial geometry
Recent trends in hierarchic document clustering: a critical review
Information Processing and Management: an International Journal
Approximate closest-point queries in high dimensions
Information Processing Letters
Distance-based indexing for high-dimensional metric spaces
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The SR-tree: an index structure for high-dimensional nearest neighbor queries
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Optimization of inverted vector searches
SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Nearest neighbor queries in metric spaces
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Density-based indexing for approximate nearest-neighbor queries
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An optimal algorithm for approximate nearest neighbor searching
SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
On the geometry of similarity search: dimensionality curse and concentration of measure
Information Processing Letters
ACM Computing Surveys (CSUR)
Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces
SIAM Journal on Computing
R-trees: a dynamic index structure for spatial searching
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Near Neighbor Search in Large Metric Spaces
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Efficient similarity search and classification via rank aggregation
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamic user-defined similarity searching in semi-structured text retrieval
Proceedings of the 3rd international conference on Scalable information systems
MiFor '09 Proceedings of the First ACM workshop on Multimedia in forensics
A flexible framework to ease nearest neighbor search in multidimensional data spaces
Data & Knowledge Engineering
Framework for evaluating clustering algorithms in duplicate detection
Proceedings of the VLDB Endowment
ACM Transactions on Information Systems (TOIS)
Mining Query Logs: Turning Search Usage Data into Knowledge
Foundations and Trends in Information Retrieval
A large-scale performance study of cluster-based high-dimensional indexing
Proceedings of the international workshop on Very-large-scale multimedia corpus, mining and retrieval
ATLAS: a probabilistic algorithm for high dimensional similarity search
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Impact of storage technology on the efficiency of cluster-based high-dimensional index creation
DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications
Indexing and searching 100M images with map-reduce
Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Hi-index | 0.00 |
Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation. Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memo.