Space-efficient sampling from social activity streams

  • Authors:
  • Nesreen K. Ahmed;Jennifer Neville;Ramana Kompella

  • Affiliations:
  • Purdue University;Purdue University;Purdue University

  • Venue:
  • Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In order to efficiently study the characteristics of network domains and support development of network systems (e.g. algorithms, protocols that operate on networks), it is often necessary to sample a representative subgraph from a large complex network. Although recent subgraph sampling methods have been shown to work well, they focus on sampling from memory-resident graphs and assume that the sampling algorithm can access the entire graph in order to decide which nodes/edges to select. Many large-scale network datasets, however, are too large and/or dynamic to be processed using main memory (e.g., email, tweets, wall posts). In this work, we formulate the problem of sampling from large graph streams. We propose a streaming graph sampling algorithm that dynamically maintains a representative sample in a reservoir based setting. We evaluate the efficacy of our proposed methods empirically using several real-world data sets. Across all datasets, we found that our method produce samples that preserve better the original graph distributions.