EverLast: a distributed architecture for preserving the web

Authors:
Avishek Anand;Srikanta Bedathur;Klaus Berberich;Ralf Schenkel;Christos Tryfonopoulos
Affiliations:
Max-Planck Institute for Informatics, Saarbrücken, Germany;Max-Planck Institute for Informatics, Saarbrücken, Germany;Max-Planck Institute for Informatics, Saarbrücken, Germany;Saarland University, Saarbrücken, Germany;Max-Planck Institute for Informatics, Saarbrücken, Germany
Venue:
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Year:
2009

Citing 23
Cited 4

Access methods for multiversion data

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
On the semantics of “now” in databases

ACM Transactions on Database Systems (TODS)
Comparison of access methods for time-evolving data

ACM Computing Surveys (CSUR)
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
An asymptotically optimal multiversion B-tree

The VLDB Journal — The International Journal on Very Large Data Bases
Silverback: A Global-Scale Archival System

Silverback: A Global-Scale Archival System
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Answering similarity queries in peer-to-peer networks

Information Systems
Building a research library for the history of the web

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Total recall: system support for automated availability management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Efficient replica maintenance for distributed storage systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
A time machine for text search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
GridVine: An Infrastructure for Peer Information Management

IEEE Internet Computing
FluxCapacitor: efficient time-travel text search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
The Juxtaposed approximate PageRank method for robust PageRank approximation in a peer-to-peer web search network

The VLDB Journal — The International Journal on Very Large Data Bases
SafeStore: a durable and practical storage system

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Personal & soho archiving

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Architectural Alternatives for Information Filtering in Structured Overlays

IEEE Internet Computing
Zoetrope: interacting with the ephemeral web

Proceedings of the 21st annual ACM symposium on User interface software and technology
Transaction time indexing with version compression

Proceedings of the VLDB Endowment
Flood little, cache more: effective result-reuse in P2P IR systems

DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications

Peer-to-peer web search: euphoria, achievements, disillusionment, and future opportunities

From active data management to event-based systems and more
Temporal shingling for version identification in web archives

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Optimizing positional index structures for versioned document collections

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A survey of web archive search architectures

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

The World Wide Web has become a key source of knowledge pertaining to almost every walk of life. Unfortunately, much of data on the Web is highly ephemeral in nature, with more than 50-80% of content estimated to be changing within a short time. Continuing the pioneering efforts of many national (digital) libraries, organizations such as the International Internet Preservation Consortium (IIPC), the Internet Archive (IA) and the European Archive (EA) have been tirelessly working towards preserving the ever changing Web. However, while these web archiving efforts have paid significant attention towards long term preservation of Web data, they have paid little attention to developing an global-scale infrastructure for collecting, archiving, and performing historical analyzes on the collected data. Based on insights from our recent work on building text analytics for Web Archives, we propose EverLast, a scalable distributed framework for next generation Web archival and temporal text analytics over the archive. Our system is built on a loosely-coupled distributed architecture that can be deployed over large-scale peer-to-peer networks. In this way, we allow the integration of many archival efforts taken mainly at a national level by national digital libraries. Key features of EverLast include support of time-based text search & analysis and the use of human-assisted archive gathering. In this paper, we outline the overall architecture of EverLast, and present some promising preliminary results.