On the benefits of transparent compression for cost-effective cloud data storage

Authors:
Bogdan Nicolae
Affiliations:
INRIA Saclay, Île-de-France
Venue:
Transactions on large-scale data- and knowledge-centered systems III
Year:
2011

Citing 23
Cited 3

Adaptive Online Data Compression

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Efficient end to end data exchange using configurable compression

ACM SIGOPS Operating Systems Review
Adaptive On-the-Fly Compression

IEEE Transactions on Parallel and Distributed Systems
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
Cooking with Linux: still searching for the ultimate linux distro?

Linux Journal
Towards efficient search on unstructured data: an intelligent-storage approach

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Technical perspective: the data center is the computer

Communications of the ACM - 50th anniversary issue: 1958 - 2008
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Opening black boxes: using semantic information to combat virtual machine image sprawl

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
A break in the clouds: towards a cloud definition

ACM SIGCOMM Computer Communication Review
Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility

Future Generation Computer Systems
Elastic management of cluster-based services in the cloud

ACDC '09 Proceedings of the 1st workshop on Automated control for datacenters and clouds
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
The Eucalyptus Open-Source Cloud-Computing System

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Enabling High Data Throughput in Desktop Grids through Decentralized Data and Metadata Management: The BlobSeer Approach

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A view of cloud computing

Communications of the ACM
Elastic Site: Using Clouds to Elastically Extend Site Resources

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
High throughput data-compression for cloud storage

Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
BlobSeer: Next-generation data management for large scale infrastructures

Journal of Parallel and Distributed Computing
Using Global Behavior Modeling to Improve QoS in Cloud Data Storage Services

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Going back and forth: efficient multideployment and multisnapshotting on clouds

Proceedings of the 20th international symposium on High performance distributed computing

A hybrid local storage transfer scheme for live migration of I/O intensive workloads

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Infrastructure-as-a-Service (IaaS) cloud computing has revolutionized the way we think of acquiring computational resources: it allows users to deploy virtual machines (VMs) at large scale and pay only for the resources that were actually used throughout the runtime of the VMs. This new model raises new challenges in the design and development of IaaS middleware: excessive storage costs associated with both user data and VM images might make the cloud less attractive, especially for users that need to manipulate huge data sets and a large number of VM images. Storage costs result not only from storage space utilization, but also from bandwidth consumption: in typical deployments, a large number of data transfers between the VMs and the persistent storage are performed, all under high performance requirements. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on data access performance. Our solution builds on BlobSeer, a distributed data management service specifically designed to sustain a high throughput for concurrent accesses to huge data sequences that are distributed at large scale. Extensive experiments demonstrate that our approach achieves large reductions (at least 40%) of bandwidth and storage space utilization, while still attaining high performance levels that even surpass the original (no compression) performance levels in several data-intensive scenarios.