The connectivity server: fast access to linkage information on the Web
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
Squeal: a structured query language for the Web
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Web page change and persistence---a four-year longitudinal study
Journal of the American Society for Information Science and Technology
The Evolution of the Web and Implications for an Incremental Crawler
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Compressing the Graph Structure of the Web
DCC '01 Proceedings of the Data Compression Conference
The WebGraph Framework II: Codes For The World-Wide Web
DCC '04 Proceedings of the Conference on Data Compression
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
The webgraph framework I: compression techniques
Proceedings of the 13th international conference on World Wide Web
Local methods for estimating pagerank values
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Modelling information persistence on the web
ICWE '06 Proceedings of the 6th international conference on Web engineering
The discoverability of the web
Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Complex queries over web repositories
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Link analysis for Web spam detection
ACM Transactions on the Web (TWEB)
Bigtable: A Distributed Storage System for Structured Data
ACM Transactions on Computer Systems (TOCS)
ACM SIGIR Forum
Temporal Evolution of the UK Web
ICDMW '08 Proceedings of the 2008 IEEE International Conference on Data Mining Workshops
Proceedings of the 20th ACM conference on Hypertext and hypermedia
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Cassandra: a decentralized structured storage system
ACM SIGOPS Operating Systems Review
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Comet: an active distributed key-value store
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Hi-index | 0.00 |
In this paper, we study efficient ways to construct, represent and analyze large-scale archival web graphs. We first discuss details of the distributed graph construction algorithm implemented in MapReduce and the design of a space-efficient layered graph representation. While designing this representation, we consider both offline and online algorithms for the graph analysis. The offline algorithms, such as PageRank, can use MapReduce and similar large-scale, distributed frameworks for computation. On the other side, online algorithms can be implemented by tapping into a scalable repository (similar to DEC's Connectivity Server or Scalable Hyperlink Store by Najork), in order to perform the computations. Moreover, we also consider updating the graph representation with the most recent information available and propose an efficient way to perform updates using MapReduce. We survey various storage options and outline essential API calls for the archival web graph specific real-time access repository. Finally, we conclude with a discussion of ideas for interesting archival web graph analysis that can lead us to discover novel patterns for designing state-of-art compression techniques.