Scalable manipulation of archival web graphs

Authors:
Yasemin Avcular;Torsten Suel
Affiliations:
Polytechnic Institute of NYU, Brooklyn, NY, USA;Polytechnic Institute of NYU, Brooklyn, NY, USA
Venue:
Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Year:
2011

Citing 26
Cited 0

The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Squeal: a structured query language for the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Web page change and persistence---a four-year longitudinal study

Journal of the American Society for Information Science and Technology
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Compressing the Graph Structure of the Web

DCC '01 Proceedings of the Data Compression Conference
The WebGraph Framework II: Codes For The World-Wide Web

DCC '04 Proceedings of the Conference on Data Compression
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Local methods for estimating pagerank values

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Modelling information persistence on the web

ICWE '06 Proceedings of the 6th international conference on Web engineering
The discoverability of the web

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Complex queries over web repositories

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Link analysis for Web spam detection

ACM Transactions on the Web (TWEB)
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
A large time-aware web graph

ACM SIGIR Forum
Temporal Evolution of the UK Web

ICDMW '08 Proceedings of the 2008 IEEE International Conference on Data Mining Workshops
The scalable hyperlink store

Proceedings of the 20th ACM conference on Hypertext and hypermedia
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Comet: an active distributed key-value store

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.