Practical Algorithms and Lower Bounds for Similarity Search in Massive Graphs

Authors:
Daniel Fogaras;Balazs Racz
Affiliations:
-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2007

Citing 23
Cited 1

Communication complexity

Communication complexity
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Computing on data streams

External memory algorithms
WebBase: a repository of Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
I/O-efficient techniques for computing pagerank

Proceedings of the eleventh international conference on Information and knowledge management
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
ANF: a fast and scalable tool for data mining in massive graphs

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
SimRank: a measure of structural-context similarity

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Node similarity in networked information spaces

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
The link prediction problem for social networks

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Scalable collaborative filtering using cluster-based smoothing

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The complexity of massive data set computations

The complexity of massive data set computations
A scalable randomized method to compute link-based similarity rank on the web graph

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

To exploit the similarity information hidden in the hyperlink structure of the Web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multistep neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [1], a recursive refinement of cocitation, and PSimRank, a novel variant with better theoretical characteristics. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. We justify our approximation method by asymptotic worst-case lower bounds: We show that there is a significant gap between exact and approximate approaches, and suggest that the exact computation, in general, is infeasible for large-scale inputs. We were the first to evaluate SimRank on real Web data. On the Stanford WebBase [2] graph of 80M pages the quality of the methods increased significantly in each refinement step until step four.