GPFS: A Shared-Disk File System for Large Computing Clusters
FAST '02 Proceedings of the Conference on File and Storage Technologies
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A Benchmark Suite for SOAP-based Communication in Grid Web Services
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Table-Driven Streaming XML Parsing Methodology for High-Performance Web Services
ICWS '06 Proceedings of the IEEE International Conference on Web Services
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
CLOUDLET: towards mapreduce implementation on virtual machines
Proceedings of the 18th ACM international symposium on High performance distributed computing
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Magellan: experiences from a science cloud
Proceedings of the 2nd international workshop on Scientific cloud computing
DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Evaluating the suitability of mapreduce for surface temperature analysis codes
Proceedings of the second international workshop on Data intensive computing in the clouds
Riding the elephant: managing ensembles with hadoop
Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
MARLA: MapReduce for Heterogeneous Clusters
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
MapReduce framework energy adaptation via temperature awareness
Cluster Computing
Hi-index | 0.00 |
MapReduce is increasingly becoming a popular framework, and a potent programming model. The most popular open source implementation of MapReduce, Hadoop, is based on the Hadoop Distributed File System (HDFS). However, as HDFS is not POSIX compliant, it cannot be fully leveraged by applications running on a majority of existing HPC environments such as Teragrid and NERSC. These HPC environments typically support globally shared file systems such as NFS and GPFS. On such resourceful HPC infrastructures, the use of Hadoop not only creates compatibility issues, but also affects overall performance due to the added overhead of the HDFS. This paper not only presents a MapReduce implementation directly suitable for HPC environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems' functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPC environments, but also allows for better performance in such settings. This paper shows the applicability and high performance of the MapReduce paradigm through MARIANE, an implementation designed for clustered and shared-disk file systems and as such not dedicated to a specific MapReduce solution. The paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in distributed environments over Apache Hadoop in a data intensive setting, on the Magellan test bed at the National Energy Research Scientific Computing Center (NERSC).