Evaluating the usefulness of content addressable storage for high-performance data intensive applications

Authors:
Partho Nath;Bhuvan Urgaonkar;Anand Sivasubramaniam
Affiliations:
Cisco Systems, Inc., San Jose, CA, USA;Pennsylvania State University, University Park, PA, USA;Pennsylvania State University, University Park, PA, USA
Venue:
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Year:
2008

Citing 26
Cited 8

Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A low-bandwidth network file system

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Wide-area cooperative storage with CFS

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Compactly encoding unstructured inputs with differential compression

Journal of the ACM (JACM)
Venti: A New Approach to Archival Storage

FAST '02 Proceedings of the Conference on File and Storage Technologies
Storage, Mutability and Naming in Pasta

Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing
Value-based web caching

WWW '03 Proceedings of the 12th international conference on World Wide Web
Internet Suspend/Resume

WMCSA '02 Proceedings of the Fourth IEEE Workshop on Mobile Computing Systems and Applications
Ivy: a read/write peer-to-peer file system

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Pastiche: making backup cheap and easy

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Reclaiming Space from Duplicate Files in a Serverless Distributed File System

ICDCS '02 Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS'02)
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Deep Store: An Archival Storage System Architecture

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Integrating Portable and Distributed Storage

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Providing High Reliability in a Minimum Redundancy Archival Storage System

MASCOTS '06 Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation
Redundancy elimination within large collections of files

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Providing tunable consistency for a parallel file store

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
TAPER: tiered approach for eliminating redundancy in replica synchronization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
An analysis of compare-by-hash

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Design, implementation, and evaluation of duplicate transfer detection in HTTP

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Secure untrusted data repository (SUNDR)

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Single instance storage in Windows® 2000

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Compare-by-hash: a reasoned analysis

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference

R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems

Proceedings of the 23rd international conference on Supercomputing
Decentralized deduplication in SAN cluster file systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Leveraging value locality in optimizing NAND flash-based SSDs

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Anchor-driven subchunk deduplication

Proceedings of the 4th Annual International Conference on Systems and Storage
GPUstore: harnessing GPU computing for storage systems in the OS kernel

Proceedings of the 5th Annual International Systems and Storage Conference
Reducing impact of data fragmentation caused by in-line deduplication

Proceedings of the 5th Annual International Systems and Storage Conference
A scalable deduplication and garbage collection engine for incremental backup

Proceedings of the 6th International Systems and Storage Conference
Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

ACM Transactions on Storage (TOS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-commerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few. In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.