High throughput data-compression for cloud storage

Authors:
Bogdan Nicolae
Affiliations:
University of Rennes 1, IRISA, Rennes, France
Venue:
Globe'10 Proceedings of the Third international conference on Data management in grid and peer-to-peer systems
Year:
2010

Citing 14
Cited 3

Parallel database systems: the future of high performance database systems

Communications of the ACM
Adaptive Online Data Compression

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Efficient end to end data exchange using configurable compression

ACM SIGOPS Operating Systems Review
Adaptive On-the-Fly Compression

IEEE Transactions on Parallel and Distributed Systems
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
Towards efficient search on unstructured data: an intelligent-storage approach

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A break in the clouds: towards a cloud definition

ACM SIGCOMM Computer Communication Review
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Market-Oriented Cloud Computing: Vision, Hype, and Reality of Delivering Computing as the 5th Utility

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Enabling High Data Throughput in Desktop Grids through Decentralized Data and Metadata Management: The BlobSeer Approach

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
BlobSeer: how to enable efficient versioning for large object storage under heavy access concurrency

Proceedings of the 2009 EDBT/ICDT Workshops

Cumulus: an open source storage cloud for science

Proceedings of the 2nd international workshop on Scientific cloud computing
On the benefits of transparent compression for cost-effective cloud data storage

Transactions on large-scale data- and knowledge-centered systems III
Exploiting MapReduce and data compression for data-intensive applications

Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

As data volumes processed by large-scale distributed data-intensive applications grow at high-speed, an increasing I/O pressure is put on the underlying storage service, which is responsible for data management. One particularly difficult challenge, that the storage service has to deal with, is to sustain a high I/O throughput in spite of heavy access concurrency to massive data. In order to do so, massively parallel data transfers need to be performed, which invariably lead to a high bandwidth utilization. With the emergence of cloud computing, data intensive applications become attractive for a wide public that does not have the resources to maintain expensive large scale distributed infrastructures to run such applications. In this context, minimizing the storage space and bandwidth utilization is highly relevant, as these resources are paid for according to the consumption. This paper evaluates the trade-off resulting from transparently applying data compression to conserve storage space and bandwidth at the cost of slight computational overhead. We aim at reducing the storage space and bandwidth needs with minimal impact on I/O throughput when under heavy access concurrency. Our solution builds on BlobSeer, a highly parallel distributed data management service specifically designed to enable reading, writing and appending huge data sequences that are fragmented and distributed at a large scale. We demonstrate the benefits of our approach by performing extensive experimentations on the Grid'5000 testbed.