Disorder inequality: a combinatorial approach to nearest neighbor search

Authors:
Navin Goyal;Yury Lifshits;Hinrich Schütze
Affiliations:
College of Computing Georgia Tech, Atlanta, GA;California Institute of Technology Pasadena, CA;University of Stuttgart, Stuttgart, Germany
Venue:
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Year:
2008

Citing 34
Cited 6

GroupLens: applying collaborative filtering to Usenet news

Communications of the ACM
Two algorithms for nearest-neighbor search in high dimensions

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
The small-world phenomenon: an algorithmic perspective

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Searching in metric spaces

ACM Computing Surveys (CSUR)
Finding nearest neighbors in growth-restricted metrics

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces

SIAM Journal on Computing
Amazon.com Recommendations: Item-to-Item Collaborative Filtering

IEEE Internet Computing
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
Navigating nets: simple algorithms for proximity search

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
UnitWalk: A New SAT Solver that Uses Local Search Guided by Unit Clause Elimination

Annals of Mathematics and Artificial Intelligence
Rank-Stability and Rank-Similarity of Link-Based Web Ranking Algorithms in Authority-Connected Graphs

Information Retrieval
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Fast Construction of Nets in Low-Dimensional Metrics and Their Applications

SIAM Journal on Computing
Searching dynamic point sets in spaces with bounded doubling dimension

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
The black-box complexity of nearest-neighbor search

Theoretical Computer Science - Automata, languages and programming: Algorithms and complexity (ICALP-A 2004)
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Visualizing tags over time

Proceedings of the 15th international conference on World Wide Web
Cover trees for nearest neighbor

ICML '06 Proceedings of the 23rd international conference on Machine learning
Graph-based text classification: learn from your neighbors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Learning random walks to rank nodes in graphs

Proceedings of the 24th international conference on Machine learning
Introduction to Information Retrieval

Introduction to Information Retrieval
MESSIF: metric similarity search implementation framework

DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development
Estimation of the click volume by large scale regression analysis

CSR'07 Proceedings of the Second international conference on Computer Science: theory and applications
Maximal intersection queries in randomized graph models

CSR'07 Proceedings of the Second international conference on Computer Science: theory and applications

Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Combinatorial Framework for Similarity Search

SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
Nearest neighbor search: algorithmic perspective

SIGSPATIAL Special
On nonmetric similarity search problems in complex domains

ACM Computing Surveys (CSUR)
Content search through comparisons

ICALP'11 Proceedings of the 38th international conference on Automata, languages and programming - Volume Part II
Novelty measures as cues for temporal salience in audio similarity

Proceedings of the second international ACM workshop on Music information retrieval with user-centered and multimodal strategies

Quantified Score

Hi-index	0.00

Visualization

Abstract

We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a'th most similar object to z and y is the b'th most similar object to z, then x is among the D(S) (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zero-error algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses near-linear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.