GPFS: A Shared-Disk File System for Large Computing Clusters
FAST '02 Proceedings of the Conference on File and Storage Technologies
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scalable repositories for virtual clusters
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
An automated approach to cloud storage service selection
Proceedings of the 2nd international workshop on Scientific cloud computing
On the duality of data-intensive file system design: reconciling HDFS and PVFS
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
MixApart: decoupled analytics for shared storage systems
HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
HFAA: a generic socket API for Hadoop file systems
Proceedings of the 2nd Workshop on Architectures and Systems for Big Data
Zone-based data striping for cloud storage
IBM Journal of Research and Development
Structuring PLFS for extensibility
PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
MixApart: decoupled analytics for shared storage systems
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
Cloud computing offers a powerful abstraction that provides a scalable, virtualized infrastructure as a service where the complexity of fine-grained resource management is hidden from the end-user. Running data analytics applications in the cloud on extremely large data sets is gaining traction as the underlying infrastructure can meet the extreme demands of scalability. Typically, these applications (e.g., business intelligence, surveillance video searches) leverage the MapReduce framework that can decompose a large computation into a set of smaller parallelizable computations. More often than not the underlying storage architecture for running a MapReduce application is based on an Internet-scale filesystem, such as GFS, which does not provide a standard (POSIX) interface. In this paper we revisit the debate on the need of a new non-POSIX storage stack for cloud analytics and argue, based on an initial evaluation, that it can be built on traditional POSIX-based cluster filesystems. In the course of the evaluation, we compare the performance of a traditional cluster file system and a specialized Internet file system for a variety of workloads for both traditional and MapReduce-based applications. We present modifications to the cluster filesystem's allocation and layout information to better support the requirements of data locality for analytics applications. We introduce the concept of a metablock that can enable the choice of a larger block granularity for MapReduce applications to coexist with a smaller block granularity required for traditional applications. We show that a cluster file system enhanced with metablocks can not only match the performance of specialized Internet file systems for MapReduce applications but also outperform them for traditional applications.