Cloud analytics: do we really need to reinvent the storage stack?

Authors:
Rajagopal Ananthanarayanan;Karan Gupta;Prashant Pandey;Himabindu Pucha;Prasenjit Sarkar;Mansi Shah;Renu Tewari
Affiliations:
IBM Research;IBM Research;IBM Research;IBM Research;IBM Research;IBM Research;IBM Research
Venue:
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Year:
2009

Citing 3
Cited 9

GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Scalable repositories for virtual clusters

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
An automated approach to cloud storage service selection

Proceedings of the 2nd international workshop on Scientific cloud computing
On the duality of data-intensive file system design: reconciling HDFS and PVFS

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
CAM: a topology aware minimum cost flow based resource manager for MapReduce applications in the cloud

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
MixApart: decoupled analytics for shared storage systems

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
HFAA: a generic socket API for Hadoop file systems

Proceedings of the 2nd Workshop on Architectures and Systems for Big Data
Zone-based data striping for cloud storage

IBM Journal of Research and Development
Structuring PLFS for extensibility

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
MixApart: decoupled analytics for shared storage systems

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud computing offers a powerful abstraction that provides a scalable, virtualized infrastructure as a service where the complexity of fine-grained resource management is hidden from the end-user. Running data analytics applications in the cloud on extremely large data sets is gaining traction as the underlying infrastructure can meet the extreme demands of scalability. Typically, these applications (e.g., business intelligence, surveillance video searches) leverage the MapReduce framework that can decompose a large computation into a set of smaller parallelizable computations. More often than not the underlying storage architecture for running a MapReduce application is based on an Internet-scale filesystem, such as GFS, which does not provide a standard (POSIX) interface. In this paper we revisit the debate on the need of a new non-POSIX storage stack for cloud analytics and argue, based on an initial evaluation, that it can be built on traditional POSIX-based cluster filesystems. In the course of the evaluation, we compare the performance of a traditional cluster file system and a specialized Internet file system for a variety of workloads for both traditional and MapReduce-based applications. We present modifications to the cluster filesystem's allocation and layout information to better support the requirements of data locality for analytics applications. We introduce the concept of a metablock that can enable the choice of a larger block granularity for MapReduce applications to coexist with a smaller block granularity required for traditional applications. We show that a cluster file system enhanced with metablocks can not only match the performance of specialized Internet file systems for MapReduce applications but also outperform them for traditional applications.