Estimating clustering coefficients and size of social networks via random walk

  • Authors:
  • Stephen J. Hardiman;Liran Katzir

  • Affiliations:
  • Capital Fund Management, Paris, France;Microsoft Research, Herzliya, Israel

  • Venue:
  • Proceedings of the 22nd international conference on World Wide Web
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Online social networks have become a major force in today's society and economy. The largest of today's social networks may have hundreds of millions to more than a billion users. Such networks are too large to be downloaded or stored locally, even if terms of use and privacy policies were to permit doing so. This limitation complicates even simple computational tasks. One such task is computing the clustering coefficient of a network. Another task is to compute the network size (number of registered users) or a subpopulation size. The clustering coefficient, a classic measure of network connectivity, comes in two flavors, global and network average. In this work, we provide efficient algorithms for estimating these measures which (1) assume no prior knowledge about the network; and (2) access the network using only the publicly available interface. More precisely, this work provides three new estimation algorithms (a) the first external access algorithm for estimating the global clustering coefficient; (b) an external access algorithm that improves on the accuracy of previous network average clustering coefficient estimation algorithms; and (c) an improved external access network size estimation algorithm. The main insight offered by this work is that only a relatively small number of public interface calls are required to allow our algorithms to achieve a high accuracy estimation. Our approach is to view a social network as an undirected graph and use the public interface to retrieve a random walk. To estimate the clustering coefficient, the connectivity of each node in the random walk sequence is tested in turn. We show that the error of this estimation drops exponentially in the number of random walk steps. Another insight of this work is the fact that, although the proposed algorithms can be used to estimate the clustering coefficient of any undirected graph, they are particularly efficient on social network-like graphs. To improve the network size prior-art estimation algorithms, we count node collision one step before they actually occur. In our experiments we validate our algorithms on several publicly available social network datasets. Our results validate the theoretical claims and demonstrate the effectiveness of our algorithms.