Counting triangles in data streams

  • Authors:
  • Luciana S. Buriol;Gereon Frahling;Stefano Leonardi;Alberto Marchetti-Spaccamela;Christian Sohler

  • Affiliations:
  • Universidade Federal de Santa Maria, Brazil;University of Paderborn, Germany;Universitá di Roma "La Sapienza", Italy;Universitá di Roma "La Sapienza", Italy;University of Paderborn, Germany

  • Venue:
  • Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.01

Visualization

Abstract

We present two space bounded random sampling algorithms that compute an approximation of the number of triangles in an undirected graph given as a stream of edges. Our first algorithm does not make any assumptions on the order of edges in the stream. It uses space that is inversely related to the ratio between the number of triangles and the number of triples with at least one edge in the induced subgraph, and constant expected update time per edge. Our second algorithm is designed for incidence streams (all edges incident to the same vertex appear consecutively). It uses space that is inversely related to the ratio between the number of triangles and length 2 paths in the graph and expected update time O(log|V|⋅(1+s⋅|V|/|E|)), where s is the space requirement of the algorithm. These results significantly improve over previous work [20, 8]. Since the space complexity depends only on the structure of the input graph and not on the number of nodes, our algorithms scale very well with increasing graph size and so they provide a basic tool to analyze the structure of large graphs. They have many applications, for example, in the discovery of Web communities, the computation of clustering and transitivity coefficient, and discovery of frequent patterns in large graphs.We have implemented both algorithms and evaluated their performance on networks from different application domains. The sizes of the considered graphs varied from about 8,000 nodes and 40,000 edges to 135 million nodes and more than 1 billion edges. For both algorithms we run experiments with parameter s=1,000, 10,000, 100,000, 1,000,000 to evaluate running time and approximation guarantee. Both algorithms appear to be time efficient for these sample sizes. The approximation quality of the first algorithm was varying significantly and even for s=1,000,000 we had more than 10% deviation for more than half of the instances. The second algorithm performed much better and even for s=10,000 we had an average deviation of less than 6% (taken over all but the largest instance for which we could not compute the number of triangles exactly).