Communication complexity
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Enhanced hypertext categorization using hyperlinks
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
External memory algorithms
WebBase: a repository of Web pages
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval
I/O-efficient techniques for computing pagerank
Proceedings of the eleventh international conference on Information and knowledge management
Approximating Aggregate Queries about Web Pages via Random Walks
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
ANF: a fast and scalable tool for data mining in massive graphs
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
SimRank: a measure of structural-context similarity
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Node similarity in networked information spaces
CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
The link prediction problem for social networks
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Sic transit gloria telae: towards an understanding of the web's decay
Proceedings of the 13th international conference on World Wide Web
Scalable collaborative filtering using cluster-based smoothing
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The complexity of massive data set computations
The complexity of massive data set computations
A scalable randomized method to compute link-based similarity rank on the web graph
EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Hi-index | 0.00 |
To exploit the similarity information hidden in the hyperlink structure of the Web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multistep neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [1], a recursive refinement of cocitation, and PSimRank, a novel variant with better theoretical characteristics. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. We justify our approximation method by asymptotic worst-case lower bounds: We show that there is a significant gap between exact and approximate approaches, and suggest that the exact computation, in general, is infeasible for large-scale inputs. We were the first to evaluate SimRank on real Web data. On the Stanford WebBase [2] graph of 80M pages the quality of the methods increased significantly in each refinement step until step four.