Combinatorial Framework for Similarity Search

Authors:
Yury Lifshits
Affiliations:
-
Venue:
SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
Year:
2009

Citing 56
Cited 3

GroupLens: applying collaborative filtering to Usenet news

Communications of the ACM
Two algorithms for nearest-neighbor search in high dimensions

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient search for approximate nearest neighbor in high dimensional spaces

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Lower bounds for high dimensional nearest neighbor search and related problems

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Measuring index quality using random walks on the Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Approximate nearest neighbor queries in fixed dimensions

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
The small-world phenomenon: an algorithmic perspective

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Closest pair queries in spatial databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Searching in metric spaces

ACM Computing Surveys (CSUR)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Finding nearest neighbors in growth-restricted metrics

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Locally adaptive dimensionality reduction for indexing large time series databases

ACM Transactions on Database Systems (TODS)
Amazon.com Recommendations: Item-to-Item Collaborative Filtering

IEEE Internet Computing
Searching in Metric Spaces by Spatial Approximation

SPIRE '99 Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Index-driven similarity search in metric spaces (Survey Article)

ACM Transactions on Database Systems (TODS)
A note on the nearest neighbor in growth-restricted metrics

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Navigating nets: simple algorithms for proximity search

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
UnitWalk: A New SAT Solver that Uses Local Search Guided by Unit Clause Elimination

Annals of Mathematics and Artificial Intelligence
Rank-Stability and Rank-Similarity of Link-Based Web Ranking Algorithms in Authority-Connected Graphs

Information Retrieval
Distance estimation and object location via rings of neighbors

Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Meridian: a lightweight network location service without virtual coordinates

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
Fast Construction of Nets in Low-Dimensional Metrics and Their Applications

SIAM Journal on Computing
Searching dynamic point sets in spaces with bounded doubling dimension

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
The black-box complexity of nearest-neighbor search

Theoretical Computer Science - Automata, languages and programming: Algorithms and complexity (ICALP-A 2004)
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Visualizing tags over time

Proceedings of the 15th international conference on World Wide Web
Cover trees for nearest neighbor

ICML '06 Proceedings of the 23rd international conference on Machine learning
Graph-based text classification: learn from your neighbors

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Routing in Networks with Low Doubling Dimension

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Local embeddings of metric spaces

Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Learning random walks to rank nodes in graphs

Proceedings of the 24th international conference on Machine learning
Spanners with slack

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
A doubling dimension threshold θ(loglogn) for augmented graph navigability

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
A divide and conquer algorithm for d-dimensional arrangement

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Disorder inequality: a combinatorial approach to nearest neighbor search

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Embedding metric spaces in their intrinsic dimension

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
A discriminative framework for clustering via similarity functions

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Introduction to Information Retrieval

Introduction to Information Retrieval
Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
MESSIF: metric similarity search implementation framework

DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development
Estimation of the click volume by large scale regression analysis

CSR'07 Proceedings of the Second international conference on Computer Science: theory and applications
Maximal intersection queries in randomized graph models

CSR'07 Proceedings of the Second international conference on Computer Science: theory and applications

Intrinsic dimensionality

SIGSPATIAL Special
Nearest neighbor search: algorithmic perspective

SIGSPATIAL Special
On nonmetric similarity search problems in complex domains

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an overview of combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points X,Y,Z answers whether Y or Z is closer to X. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if X is the A'th most similar object to Y and Y is the B'th most similar object to Z, then X is among the D(A+B) most similar objects to Z, where D is a relatively small disorder constant. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. Ranwalk, the first known combinatorial solution for nearest neighbors, is randomized, exact, zero-error algorithm with query time that is logarithmic in number of objects. But Ranwalk preprocessing time is quadratic. Later on, another solution, called combinatorial nets, was discovered. It is deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and near-logarithmic time complexity of search. Combinatorial nets also have a number of side applications. For near-duplicate detection they lead to the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output. For any dataset with small disorder combinatorial nets can be used to construct a visibility graph: the one in which greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known work-around for Navarro's impossibility of generalizing Delaunay graphs.