Scalable manipulation of archival web graphs

  • Authors:
  • Yasemin Avcular;Torsten Suel

  • Affiliations:
  • Polytechnic Institute of NYU, Brooklyn, NY, USA;Polytechnic Institute of NYU, Brooklyn, NY, USA

  • Venue:
  • Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.