I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6

Authors:
Jing Fu;Robert Latham;Misun Min;Christopher D. Carothers
Affiliations:
Rensselaer Poly. Inst., Troy, NY;Argonne National Laboratory, Argonne, IL;Argonne National Laboratory, Argonne, IL;Rensselaer Poly. Inst., Troy, NY
Venue:
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Year:
2012

Citing 15
Cited 1

Proceedings of the Second Conference on Hypercube Multiprocessors on Hypercube multiprocessors

Proceedings of the Second Conference on Hypercube Multiprocessors on Hypercube multiprocessors
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Investigation of leading HPC I/O performance using a scientific-application derived benchmark

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scaling parallel I/O performance through I/O delegate and caching system

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
HPC global file system performance analysis using a scientific-application derived benchmark

Parallel Computing
Nodal Discontinuous Galerkin Methods: Algorithms, Analysis, and Applications

Nodal Discontinuous Galerkin Methods: Algorithms, Analysis, and Applications
Application level I/O caching on Blue Gene/P systems

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Adaptable, metadata rich IO methods for portable high performance IO

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
I/O performance challenges at leadership scale

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
GPFS: a shared-disk file system for large computing clusters

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System

CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Delegation-Based I/O Mechanism for High Performance Computing Systems

IEEE Transactions on Parallel and Distributed Systems

Scalable in situ scientific data encoding for analytical query processing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Application-level checkpointing has been one of the most popular techniques to proactively deal with unexpected failures in supercomputers with hundreds of thousands of cores. Unfortunately, this approach results in heavy I/O load and often causes I/O bottlenecks in production runs. In this paper, we examine a new thread-based application-level checkpointing for a massively parallel electromagnetic solver system on the IBM Blue Gene/P at Argonne National Laboratory and the Cray XK6 at Oak Ridge National Laboratory. We discuss an I/O-thread based, application-level, two-phase I/O approach, called "threaded reduced-blocking I/O" (threaded rbIO), and compare it with a regular version of "reduced-blocking I/O" (rbIO) and a tuned MPI-IO collective approach (coIO). Our study shows that threaded rbIO can overlap the I/O latency with computation and achieve near-asynchronous checkpoint with an application-perceived I/O performance of over 70 GB/s (raw of 15 GB/s) and 50 GB/s (raw I/O bandwidth of 17 GB/s) on up to 32K processors of Intrepid and Jaguar, respectively. Compared with rbIO and coIO, the threading approach greatly improves the production performance of NekCEM on Blue Gene/P and Cray XK6 machines by significantly reducing the total simulation time from checkpoint blocking reduction. We also discuss the potential strength of this approach with the Scalable Checkpoint Restart library and on other full-featured operating systems such as that to be deployed on the upcoming Blue Gene/Q.