Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design

Authors:
Yury Lifshits;Shengyu Zhang
Affiliations:
Yahoo! Research;The Chinese University of Hong Kong
Venue:
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Year:
2009

Citing 29
Cited 4

Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient search for approximate nearest neighbor in high dimensional spaces

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Approximate nearest neighbor queries in fixed dimensions

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
The small-world phenomenon: an algorithmic perspective

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Closest pair queries in spatial databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Finding nearest neighbors in growth-restricted metrics

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Locally adaptive dimensionality reduction for indexing large time series databases

ACM Transactions on Database Systems (TODS)
Searching in metric spaces by spatial approximation

The VLDB Journal — The International Journal on Very Large Data Bases
A note on the nearest neighbor in growth-restricted metrics

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Navigating nets: simple algorithms for proximity search

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Improved robustness of signature-based near-replica detection via lexicon randomization

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Distance estimation and object location via rings of neighbors

Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing
Meridian: a lightweight network location service without virtual coordinates

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
Similarity Search: The Metric Space Approach (Advances in Database Systems)

Similarity Search: The Metric Space Approach (Advances in Database Systems)
Fast Construction of Nets in Low-Dimensional Metrics and Their Applications

SIAM Journal on Computing
Searching dynamic point sets in spaces with bounded doubling dimension

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
The black-box complexity of nearest-neighbor search

Theoretical Computer Science - Automata, languages and programming: Algorithms and complexity (ICALP-A 2004)
Cover trees for nearest neighbor

ICML '06 Proceedings of the 23rd international conference on Machine learning
Routing in Networks with Low Doubling Dimension

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Spanners with slack

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
A doubling dimension threshold θ(loglogn) for augmented graph navigability

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
A divide and conquer algorithm for d-dimensional arrangement

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Disorder inequality: a combinatorial approach to nearest neighbor search

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
A discriminative framework for clustering via similarity functions

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Estimation of the click volume by large scale regression analysis

CSR'07 Proceedings of the Second international conference on Computer Science: theory and applications

Combinatorial Framework for Similarity Search

SISAP '09 Proceedings of the 2009 Second International Workshop on Similarity Search and Applications
Nearest neighbor search: algorithmic perspective

SIGSPATIAL Special
Content search through comparisons

ICALP'11 Proceedings of the 38th international conference on Automata, languages and programming - Volume Part II
Fast approximate nearest-neighbor search with k-nearest neighbor graph

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, y' answers whether y or y' is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a'th most similar object to y and y is the b'th most similar object to z, then x is among the D(a + b) most similar objects to z, where D is a relatively small disorder constant. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, one can still design very efficient algorithms for various fundamental computational tasks. For nearest neighbor search we present deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and near-logarithmic time complexity of search. Then, for near-duplicate detection we present the first known deterministic algorithm that requires just near-linear time + time proportional to the size of output. Finally, we show that for any dataset satisfying the disorder inequality a visibility graph can be constructed: all outdegrees are near-logarithmic and greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known work-around for Navarro's impossibility of generalizing Delaunay graphs. The technical contribution of the paper consists of handling "false positives" in data structures and an algorithmic technique up-aside-down-filter.