On the duality of data-intensive file system design: reconciling HDFS and PVFS

Authors:
Wittawat Tantisiriroj;Seung Woo Son;Swapnil Patil;Samuel J. Lang;Garth Gibson;Robert B. Ross
Affiliations:
Carnegie Mellon University;Argonne National Laboratory;Carnegie Mellon University;Argonne National Laboratory;Carnegie Mellon University;Argonne National Laboratory
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 16
Cited 6

GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Experiences with MapReduce, an abstraction for large-scale computation

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Scalable performance of the Panasas parallel file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
I/O performance challenges at leadership scale

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
DiscFinder: a data-intensive scalable cluster finder for astrophysics

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Cloud analytics: do we really need to reinvent the storage stack?

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Spectral analysis for billion-scale graphs: discoveries and implementation

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MixApart: decoupled analytics for shared storage systems

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
HFAA: a generic socket API for Hadoop file systems

Proceedings of the 2nd Workshop on Architectures and Systems for Big Data
High performance RDMA-based design of HDFS over InfiniBand

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
IKAROS: An HTTP-Based Distributed File System, for Low Consumption & Low Specification Devices

Journal of Grid Computing
MixApart: decoupled analytics for shared storage systems

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-intensive applications fall into two computing styles: Internet services (cloud computing) or high-performance computing (HPC). In both categories, the underlying file system is a key component for scalable application performance. In this paper, we explore the similarities and differences between PVFS, a parallel file system used in HPC at large scale, and HDFS, the primary storage system used in cloud computing with Hadoop. We integrate PVFS into Hadoop and compare its performance to HDFS using a set of data-intensive computing benchmarks. We study how HDFS-specific optimizations can be matched using PVFS and how consistency, durability, and persistence tradeoffs made by these file systems affect application performance. We show how to embed multiple replicas into a PVFS file, including a mapping with a complete copy local to the writing client, to emulate HDFS's file layout policies. We also highlight implementation issues with HDFS's dependence on disk bandwidth and benefits from pipelined replication.