GroupLens: applying collaborative filtering to Usenet news
Communications of the ACM
Two algorithms for nearest-neighbor search in high dimensions
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Measuring index quality using random walks on the Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
The small-world phenomenon: an algorithmic perspective
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
ACM Computing Surveys (CSUR)
Finding nearest neighbors in growth-restricted metrics
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces
SIAM Journal on Computing
Amazon.com Recommendations: Item-to-Item Collaborative Filtering
IEEE Internet Computing
Efficient similarity search and classification via rank aggregation
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Index-driven similarity search in metric spaces (Survey Article)
ACM Transactions on Database Systems (TODS)
Navigating nets: simple algorithms for proximity search
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Dictionary matching and indexing with errors and don't cares
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
UnitWalk: A New SAT Solver that Uses Local Search Guided by Unit Clause Elimination
Annals of Mathematics and Artificial Intelligence
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Fast Construction of Nets in Low-Dimensional Metrics and Their Applications
SIAM Journal on Computing
Searching dynamic point sets in spaces with bounded doubling dimension
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
The black-box complexity of nearest-neighbor search
Theoretical Computer Science - Automata, languages and programming: Algorithms and complexity (ICALP-A 2004)
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
Proceedings of the 15th international conference on World Wide Web
Cover trees for nearest neighbor
ICML '06 Proceedings of the 23rd international conference on Machine learning
Graph-based text classification: learn from your neighbors
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions
FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Do not crawl in the dust: different urls with similar text
Proceedings of the 16th international conference on World Wide Web
Learning random walks to rank nodes in graphs
Proceedings of the 24th international conference on Machine learning
Introduction to Information Retrieval
Introduction to Information Retrieval
MESSIF: metric similarity search implementation framework
DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development
Estimation of the click volume by large scale regression analysis
CSR'07 Proceedings of the Second international conference on Computer Science: theory and applications
Maximal intersection queries in randomized graph models
CSR'07 Proceedings of the Second international conference on Computer Science: theory and applications
Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Combinatorial Framework for Similarity Search
SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
Nearest neighbor search: algorithmic perspective
SIGSPATIAL Special
On nonmetric similarity search problems in complex domains
ACM Computing Surveys (CSUR)
Content search through comparisons
ICALP'11 Proceedings of the 38th international conference on Automata, languages and programming - Volume Part II
Novelty measures as cues for temporal salience in audio similarity
Proceedings of the second international ACM workshop on Music information retrieval with user-centered and multimodal strategies
Hi-index | 0.00 |
We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a'th most similar object to z and y is the b'th most similar object to z, then x is among the D(S) (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zero-error algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses near-linear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.