Archiving the relaxed consistency web

Authors:
Zhiwu Xie;Herbert Van de Sompel;Jinyang Liu;Johann van Reenen;Ramiro Jordan
Affiliations:
Virginia Tech, Blacksburg, VA, USA;Los Alamos National Laboratory, Los Alamos, NM, USA;Howard Hughes Medical Institute, Ashburn, VA, USA;University of New Mexico, Albuquerque, NM, USA;University of New Mexico, Albuquerque, NM, USA
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 32
Cited 0

Towards robust distributed systems (abstract)

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services

ACM SIGACT News
Don't Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Ganymed: scalable replication for transactional web applications

Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware
Database Replication Using Generalized Snapshot Isolation

SRDS '05 Proceedings of the 24th IEEE Symposium on Reliable Distributed Systems
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Eventually consistent

Communications of the ACM - Rural engineering development
Scalable query result caching for web applications

Proceedings of the VLDB Endowment
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment
H-store: a high-performance, distributed main memory transaction processing system

Proceedings of the VLDB Endowment
SHARC: framework for quality-conscious web archiving

Proceedings of the VLDB Endowment
What is Twitter, a social network or a news media?

Proceedings of the 19th international conference on World wide web
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
Feeding frenzy: selectively materializing users' event feeds

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
The case for determinism in database systems

Proceedings of the VLDB Endowment
Using Paxos to build a scalable, consistent, and highly available datastore

Proceedings of the VLDB Endowment
Web Archiving

Web Archiving
Feed following: the big data challenge in social applications

Databases and Social Networks
Archiving the web using page changes patterns: a case study

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
A survey on web archiving initiatives

TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Don't settle for eventual: scalable causal consistency for wide-area storage with COPS

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Eventual consistency: How soon is eventual? An evaluation of Amazon S3's consistency behavior

Proceedings of the 6th Workshop on Middleware for Service Oriented Computing
PNUTS in Flight: Web-Scale Data Serving at Yahoo

IEEE Internet Computing
Probabilistically bounded staleness for practical partial quorums

Proceedings of the VLDB Endowment
On the institutional archiving of social media

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story

Computer
Toward a principled framework for benchmarking consistency

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Losing my revolution: how many resources shared on social media have been lost?

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Bolt-on causal consistency

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

The historical, cultural, and intellectual importance of archiving the web has been widely recognized. Today, all countries with high Internet penetration rate have established high-profile archiving initiatives to crawl and archive the fast-disappearing web content for long-term use. As web technologies evolve, established web archiving techniques face challenges. This paper focuses on the potential impact of the relaxed consistency web design on crawler driven web archiving. Relaxed consistent websites may disseminate, albeit ephemerally, inaccurate and even contradictory information. If captured and preserved in the web archives as historical records, such information will degrade the overall archival quality. To assess the extent of such quality degradation, we build a simplified feed-following application and simulate its operation with synthetic workloads. The results indicate that a non-trivial portion of a relaxed consistency web archive may contain observable inconsistency, and the inconsistency window may extend significantly longer than that observed at the data store. We discuss the nature of such quality degradation and propose a few possible remedies.