Lessons learned at 208K: towards debugging millions of cores
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Formal verification of practical MPI programs
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An efficient, robust, domain-decomposition algorithm for particle Monte Carlo
Journal of Computational Physics
Scalable temporal order analysis for large scale debugging
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The ideal HPC programming language
Communications of the ACM
A Scalable and Distributed Dynamic Formal Verifier for MPI Programs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Dynamic verification of hybrid programs
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Exploring unexpected behavior in MPI
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Large Scale Verification of MPI Programs Using Lamport Clocks with Lazy Update
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Improving the performance scalability of the community atmosphere model
International Journal of High Performance Computing Applications
Probabilistic diagnosis of performance faults in large-scale parallel applications
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
Reproducibility, the ability to repeat program executions with the same numerical result or code behavior, is crucial for computational science and engineering applications. However, non-determinism in concurrency scheduling often hampers achieving this ability on high performance computing (HPC) systems. To aid in managing the adverse effects of non-determinism, prior work has provided techniques to achieve bit-precise reproducibility, but most of them focus only on small-scale parallelism. While scalable techniques recently emerged, they are disparate and target special purposes, e.g., single-schedule domains. On current systems with O(106) compute cores and future ones with O(109), any technique that does not embrace a unified, targeted, and multilevel approach will fall short of providing reproducibility. In this paper, we argue for a common toolset that embodies this approach, where programmers select and compose complementary tools and can effectively, yet scalably, analyze, control, and eliminate sources of non-determinism at scale. This allows users to gain reproducibility only to the levels demanded by specific code development needs. We present our research agenda and ongoing work toward this goal.