BlobSeer: Next-generation data management for large scale infrastructures

Authors:
Bogdan Nicolae;Gabriel Antoniu;Luc Bougé;Diana Moise;Alexandra Carpen-Amarie
Affiliations:
University of Rennes 1, IRISA, Campus de Beaulieu, 35042 Rennes cedex, France;INRIA Rennes-Bretagne Atlantique, IRISA, Campus de Beaulieu, 35042 Rennes cedex, France;ENS Cachan - Brittany, IRISA, Campus de Beaulieu, 35042 Rennes cedex, France;INRIA Rennes-Bretagne Atlantique, IRISA, Campus de Beaulieu, 35042 Rennes cedex, France;INRIA Rennes-Bretagne Atlantique, IRISA, Campus de Beaulieu, 35042 Rennes cedex, France
Venue:
Journal of Parallel and Distributed Computing
Year:
2011

Citing 28
Cited 17

Linearizability: a correctness condition for concurrent objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Deciding when to forget in the Elephant file system

Proceedings of the seventeenth ACM symposium on Operating systems principles
A grid-enabled MPI: message passing in heterogeneous distributed computing systems

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Data management and transfer in high-performance computational grid environments

Parallel Computing - Parallel data-intensive algorithms and applications
Chord: a scalable peer-to-peer lookup protocol for internet applications

IEEE/ACM Transactions on Networking (TON)
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
The many faces of publish/subscribe

ACM Computing Surveys (CSUR)
Grid Datafarm Architecture for Petascale Data Intensive Computing

CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
OpenDHT: a public DHT service and its uses

Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Ceph: a scalable, high-performance distributed file system

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Towards efficient search on unstructured data: an intelligent-storage approach

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
B-trees, shadowing, and clones

ACM Transactions on Storage (TOS)
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A Taxonomy and Survey on Distributed File Systems

NCM '08 Proceedings of the 2008 Fourth International Conference on Networked Computing and Advanced Information Management - Volume 01
The XtreemFS architecture—a case for object-based file systems in Grids

Concurrency and Computation: Practice & Experience - Selection of Best Papers of the VLDB Data Management in Grids Workshop (VLDB DMG 2007)
A break in the clouds: towards a cloud definition

ACM SIGCOMM Computer Communication Review
GridNFS: global storage for global collaborations

LGDI '05 Proceedings of the 2005 IEEE International Symposium on Mass Storage Systems and Technology
A Range Query Model Based on DHT in P2P System

NSWCTC '09 Proceedings of the 2009 International Conference on Networks Security, Wireless Communications and Trusted Computing - Volume 01
Enabling High Data Throughput in Desktop Grids through Decentralized Data and Metadata Management: The BlobSeer Approach

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
BlobSeer: how to enable efficient versioning for large object storage under heavy access concurrency

Proceedings of the 2009 EDBT/ICDT Workshops
File-based replica management

Future Generation Computer Systems
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)

Optimizing intermediate data management in MapReduce computations

Proceedings of the First International Workshop on Cloud Computing Platforms
Going back and forth: efficient multideployment and multisnapshotting on clouds

Proceedings of the 20th international symposium on High performance distributed computing
On the benefits of transparent compression for cost-effective cloud data storage

Transactions on large-scale data- and knowledge-centered systems III
Optimizing multi-deployment on clouds by means of self-adaptive prefetching

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Bringing introspection into BlobSeer: Towards a self-adaptive distributed data management system

International Journal of Applied Mathematics and Computer Science - SPECIAL SECTION: Efficient Resource Management for Grid-Enabled Applications
Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications

GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
Towards scalable array-oriented active storage: the pyramid approach

ACM SIGOPS Operating Systems Review
A hybrid local storage transfer scheme for live migration of I/O intensive workloads

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
TomusBlobs: Towards Communication-Efficient Storage for MapReduce Applications in Azure

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Snooze: A Scalable and Autonomic Virtual Machine Management Framework for Private Clouds

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Riding Out the Storm: How to Deal with the Complexity of Grid and Cloud Management

Journal of Grid Computing
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Towards a Generic Security Framework for Cloud Data Management Environments

International Journal of Distributed Systems and Technologies
Evaluating cloud storage services for tightly-coupled applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing
A patch-based data reorganization method for coupling large-scale simulations and parallel visualization

Transactions on Edutainment IX

Quantified Score

Hi-index	0.00

Visualization

Abstract

As data volumes increase at a high speed in more and more application fields of science, engineering, information services, etc., the challenges posed by data-intensive computing gain increasing importance. The emergence of highly scalable infrastructures, e.g. for cloud computing and for petascale computing and beyond, introduces additional issues for which scalable data management becomes an immediate need. This paper makes several contributions. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, we highlight the potentially large benefits of using versioning in this context. Second, based on these principles, we propose a set of versioning algorithms, both for data and metadata, that enable a high throughput under concurrency. Finally, we implement and evaluate these algorithms in the BlobSeer prototype, that we integrate as a storage backend in the Hadoop MapReduce framework. We perform extensive microbenchmarks as well as experiments with real MapReduce applications: they demonstrate that applying the principles defended in our approach brings substantial benefits to data intensive applications.