Overcoming extreme-scale reproducibility challenges through a unified, targeted, and multilevel toolset

Authors:
Dong H. Ahn;Gregory L. Lee;Ganesh Gopalakrishnan;Zvonimir Rakamarić;Martin Schulz;Ignacio Laguna
Affiliations:
Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA;University of Utah;University of Utah;Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA
Venue:
SE-HPCCSE '13 Proceedings of the 1st International Workshop on Software Engineering for High Performance Computing in Computational Science and Engineering
Year:
2013

Citing 11
Cited 0

Lessons learned at 208K: towards debugging millions of cores

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Formal verification of practical MPI programs

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An efficient, robust, domain-decomposition algorithm for particle Monte Carlo

Journal of Computational Physics
Scalable temporal order analysis for large scale debugging

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The ideal HPC programming language

Communications of the ACM
A Scalable and Distributed Dynamic Formal Verifier for MPI Programs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Dynamic verification of hybrid programs

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Exploring unexpected behavior in MPI

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Large Scale Verification of MPI Programs Using Lamport Clocks with Lazy Update

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Improving the performance scalability of the community atmosphere model

International Journal of High Performance Computing Applications
Probabilistic diagnosis of performance faults in large-scale parallel applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reproducibility, the ability to repeat program executions with the same numerical result or code behavior, is crucial for computational science and engineering applications. However, non-determinism in concurrency scheduling often hampers achieving this ability on high performance computing (HPC) systems. To aid in managing the adverse effects of non-determinism, prior work has provided techniques to achieve bit-precise reproducibility, but most of them focus only on small-scale parallelism. While scalable techniques recently emerged, they are disparate and target special purposes, e.g., single-schedule domains. On current systems with O(106) compute cores and future ones with O(109), any technique that does not embrace a unified, targeted, and multilevel approach will fall short of providing reproducibility. In this paper, we argue for a common toolset that embodies this approach, where programmers select and compose complementary tools and can effectively, yet scalably, analyze, control, and eliminate sources of non-determinism at scale. This allows users to gain reproducibility only to the levels demanded by specific code development needs. We present our research agenda and ongoing work toward this goal.