GPFS: A Shared-Disk File System for Large Computing Clusters
FAST '02 Proceedings of the Conference on File and Storage Technologies
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Experiences with MapReduce, an abstraction for large-scale computation
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Scalable performance of the Panasas parallel file system
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
I/O performance challenges at leadership scale
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
DiscFinder: a data-intensive scalable cluster finder for astrophysics
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Cloud analytics: do we really need to reinvent the storage stack?
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Spectral analysis for billion-scale graphs: discoveries and implementation
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MixApart: decoupled analytics for shared storage systems
HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
HFAA: a generic socket API for Hadoop file systems
Proceedings of the 2nd Workshop on Architectures and Systems for Big Data
High performance RDMA-based design of HDFS over InfiniBand
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
IKAROS: An HTTP-Based Distributed File System, for Low Consumption & Low Specification Devices
Journal of Grid Computing
MixApart: decoupled analytics for shared storage systems
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
Data-intensive applications fall into two computing styles: Internet services (cloud computing) or high-performance computing (HPC). In both categories, the underlying file system is a key component for scalable application performance. In this paper, we explore the similarities and differences between PVFS, a parallel file system used in HPC at large scale, and HDFS, the primary storage system used in cloud computing with Hadoop. We integrate PVFS into Hadoop and compare its performance to HDFS using a set of data-intensive computing benchmarks. We study how HDFS-specific optimizations can be matched using PVFS and how consistency, durability, and persistence tradeoffs made by these file systems affect application performance. We show how to embed multiple replicas into a PVFS file, including a mapping with a complete copy local to the writing client, to emulate HDFS's file layout policies. We also highlight implementation issues with HDFS's dependence on disk bandwidth and benefits from pipelined replication.