Scaling link-based similarity search

Authors:
Dániel Fogaras;Balázs Rácz
Affiliations:
Budapest University of Technology and Economics, Budapest, Hungary;Computer and Automation Research Institute of the Hungarian Academy of Sciences, Budapest, Hungary
Venue:
WWW '05 Proceedings of the 14th international conference on World Wide Web
Year:
2005

Citing 22
Cited 38

Randomized algorithms

Randomized algorithms
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Enhanced hypertext categorization using hyperlinks

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
WebBase: a repository of Web pages

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
On near-uniform URL sampling

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
I/O-efficient techniques for computing pagerank

Proceedings of the eleventh international conference on Information and knowledge management
Approximating Aggregate Queries about Web Pages via Random Walks

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
ANF: a fast and scalable tool for data mining in massive graphs

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
SimRank: a measure of structural-context similarity

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Node similarity in networked information spaces

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
The link prediction problem for social networks

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
A scalable randomized method to compute link-based similarity rank on the web graph

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology

Hyperlink analysis on the world wide web

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
To randomize or not to randomize: space optimal summaries for hyperlink analysis

Proceedings of the 15th international conference on World Wide Web
LinkClus: efficient clustering via heterogeneous semantic links

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Personalized query expansion for the web

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting splogs via temporal dynamics using self-similarity analysis

ACM Transactions on the Web (TWEB)
People search: Searching people sharing similar interests from the Web

Journal of the American Society for Information Science and Technology
Efficient semi-streaming algorithms for local triangle counting in massive graphs

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Imagination: Exploiting Link Analysis for Accurate Image Annotation

Adaptive Multimedial Retrieval: Retrieval, User, and Semantics
Accuracy estimate and optimization techniques for SimRank computation

Proceedings of the VLDB Endowment
Sponsored ad-based similarity: an approach to mining collective advertiser intelligence

Proceedings of the 2nd International Workshop on Data Mining and Audience Intelligence for Advertising
An Adaptive Method for the Efficient Similarity Calculation

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Using Link-Based Content Analysis to Measure Document Similarity Effectively

APWeb/WAIM '09 Proceedings of the Joint International Conferences on Advances in Data and Web Management
Calculating Similarity Efficiently in a Small World

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
P-Rank: a comprehensive structural similarity measure over information networks

Proceedings of the 18th ACM conference on Information and knowledge management
Accuracy estimate and optimization techniques for SimRank computation

The VLDB Journal — The International Journal on Very Large Data Bases
Fast computation of SimRank for static and dynamic information networks

Proceedings of the 13th International Conference on Extending Database Technology
Web mediators for accessible browsing

ERCIM'06 Proceedings of the 9th conference on User interfaces for all
Exploring the power of heuristics and links in multi-relational data mining

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Parallel SimRank computation on large graphs with iterative aggregation

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient algorithms for large-scale local triangle counting

ACM Transactions on Knowledge Discovery from Data (TKDD)
Adaptive combination of tag and link-based user similarity in flickr

Proceedings of the international conference on Multimedia
Taming computational complexity: efficient and parallel simrank optimizations on undirected graphs

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Link proximity analysis: clustering websites by examining link proximity

ECDL'10 Proceedings of the 14th European conference on Research and advanced technology for digital libraries
A fast two-stage algorithm for computing SimRank and its extensions

WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Axiomatic ranking of network role similarity

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Pairwise similarity calculation of information networks

DaWaK'11 Proceedings of the 13th international conference on Data warehousing and knowledge discovery
ASAP: towards accurate, stable and accelerative penetrating-rank estimation on large graphs

WAIM'11 Proceedings of the 12th international conference on Web-age information management
Finding information nebula over large networks

Proceedings of the 20th ACM international conference on Information and knowledge management
MFCRank: a web ranking algorithm based on correlation of multiple features

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
A space and time efficient algorithm for SimRank computation

World Wide Web
Computational folkloristics

Communications of the ACM
Delta-SimRank computing on MapReduce

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
On the efficiency of estimating penetrating rank on large graphs

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
E-rank: A Structural-Based Similarity Measure in Social Networks

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Scalable and axiomatic ranking of network role similarity

ACM Transactions on Knowledge Discovery from Data (TKDD) - Casin special issue
Efficient simrank-based similarity join over large graphs

Proceedings of the VLDB Endowment
Assessing single-pair similarity over graphs by aggregating first-meeting probabilities

Information Systems
Structure/attribute computation of similarities between nodes of a RDF graph with application to linked data clustering

Intelligent Data Analysis

Quantified Score

Hi-index	0.02

Visualization

Abstract

To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26]. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than single-step neighborhoods.