MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns

Authors:
Saba Sehrish;Grant Mackey;Jun Wang;John Bent
Affiliations:
University of Central Florida;University of Central Florida;University of Central Florida;Los Alamos National Laboratory
Venue:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Year:
2010

Citing 20
Cited 3

Using MPI-2: Advanced Features of the Message Passing Interface

Using MPI-2: Advanced Features of the Message Passing Interface
DPFS: A Distributed Parallel File System

ICPP '02 Proceedings of the 2001 International Conference on Parallel Processing
Noncontiguous I/O Accesses Through MPI-IO

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Data Sieving and Collective I/O in ROMIO

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Parallel netCDF: A High-Performance Scientific I/O Interface

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the 1st USENIX Conference on File and Storage Technologies
Experiences with MapReduce, an abstraction for large-scale computation

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
MapReduce for Data Intensive Scientific Analyses

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
CloudBurst

Bioinformatics
Distributed data-parallel computing using a high-level programming language

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Small-file access in parallel file systems

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
DryadLINQ for Scientific Analyses

E-SCIENCE '09 Proceedings of the 2009 Fifth IEEE International Conference on e-Science
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Scheduling mapreduce jobs in HPC clusters

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to the explosive growth in the size of scientific data sets, data-intensive computing is an emerging trend in computational science. Many application scientists are looking to integrate data-intensive computing into computational-intensive High Performance Computing facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. For every MapReduce application that must be run in order to complete the desired data analysis, a distributed read and write operation on the file system must be performed. Our contribution is to extend Map-Reduce to eliminate the multiple scans and also reduce the number of pre-processing MapReduce programs. We have added additional expressiveness to the MapReduce language to allow users to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data pre-processing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. Using our augmented MapReduce system, MapReduce with Access Patterns (MRAP), we have demonstrated up to 33% throughput improvement in one real application, and up to 70% in an I/O kernel of another application.