HFAA: a generic socket API for Hadoop file systems

Authors:
Adam Yee;Jeffrey Shafer
Affiliations:
University of the Pacific, Stockton, CA;University of the Pacific, Stockton, CA
Venue:
Proceedings of the 2nd Workshop on Architectures and Systems for Big Data
Year:
2012

Citing 9
Cited 0

The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Ceph: a scalable, high-performance distributed file system

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Scalable performance of the Panasas parallel file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
GFS: Evolution on Fast-forward

Queue - File Systems
Cloud analytics: do we really need to reinvent the storage stack?

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
GPFS: a shared-disk file system for large computing clusters

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
On the duality of data-intensive file system design: reconciling HDFS and PVFS

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hadoop is an open-source implementation of the MapReduce programming model for distributed computing. Hadoop natively integrates with the Hadoop Distributed File System (HDFS), a user-level file system. In this paper, we introduce the Hadoop Filesystem Agnostic API (HFAA) to allow Hadoop to integrate with any distributed file system over TCP sockets. With this API, HDFS can be replaced by distributed file systems such as PVFS, Ceph, Lustre, or others, thereby allowing direct comparisons in terms of performance and scalability. Unlike previous attempts at augmenting Hadoop with new file systems, the socket API presented here eliminates the need to customize Hadoop's Java implementation, and instead moves the implementation responsibilities to the file system itself. Thus, developers wishing to integrate their new file system with Hadoop are not responsible for understanding details of Hadoop's internal operation. In this paper, an initial implementation of HFAA is used to replace HDFS with PVFS, a file system popular in high-performance computing environments. Compared with an alternate method of integrating with PVFS (a POSIX kernel interface), HFAA increases write and read throughput by 23% and 7%, respectively.