Debugging Parallel Programs with Instant Replay
IEEE Transactions on Computers
LogGP: incorporating long messages into the LogP model for parallel computation
Journal of Parallel and Distributed Computing
MPI-SIM: using parallel simulation to evaluate MPI programs
Proceedings of the 30th conference on Winter simulation
Predictive analysis of a wavefront application using LogGP
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Automatically characterizing large scale program behavior
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
DiP: A Parallel Program Development Environment
Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
A framework for performance modeling and prediction
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Array regrouping and structure splitting using whole-program reference affinity
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Cross-architecture performance predictions for scientific applications using parameterized models
Proceedings of the joint international conference on Measurement and modeling of computer systems
Performance Prediction Using Simulation of Large-Scale Interconnection Networks in POSE
Proceedings of the 19th Workshop on Principles of Advanced and Distributed Simulation
International Journal of High Performance Computing Applications
Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Performance Model of the Krak Hydrodynamics Application
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Scaling an optimistic parallel simulation of large-scale interconnection networks
WSC '05 Proceedings of the 37th conference on Winter simulation
Methods of inference and learning for performance modeling of parallel applications
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
A regression-based approach to scalability prediction
Proceedings of the 22nd annual international conference on Supercomputing
Performance prediction of large-scale parallell system and application using macro-level simulation
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
MPIWiz: subgroup reproducible replay of mpi applications
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
FACT: fast communication trace collection for parallel applications through program slicing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A performance model of non-deterministic particle transport on large-scale systems
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A compiler-based communication analysis approach for multiprocessor systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Retrospect: deterministic replay of MPI applications for interactive distributed debugging
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
LReplay: a pending period based deterministic replay scheme
Proceedings of the 37th annual international symposium on Computer architecture
ScalaExtrap: trace-based communication extrapolation for spmd programs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Parkour: parallel speedup estimates for serial programs
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Kismet: parallel speedup estimates for serial programs
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
ScalaExtrap: Trace-based communication extrapolation for SPMD programs
ACM Transactions on Programming Languages and Systems (TOPLAS)
Extending the BT NAS parallel benchmark to exascale computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Elastic and scalable tracing and accurate replay of non-deterministic events
Proceedings of the 27th international ACM conference on International conference on supercomputing
ACIC: automatic cloud I/O configurator for HPC applications
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Using automated performance modeling to find scalability bugs in complex codes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Exploiting GPU Hardware Saturation for Fast Compiler Optimization
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
For designers of large-scale parallel computers, it is greatly desired that performance of parallel applications can be predicted at the design phase. However, this is difficult because the execution time of parallel applications is determined by several factors, including sequential computation time in each process, communication time and their convolution. Despite previous efforts, it remains an open problem to estimate sequential computation time in each process accurately and efficiently for large-scale parallel applications on non-existing target machines. This paper proposes a novel approach to predict the sequential computation time accurately and efficiently. We assume that there is at least one node of the target platform but the whole target system need not be available. We make two main technical contributions. First, we employ deterministic replay techniques to execute any process of a parallel application on a single node at real speed. As a result, we can simply measure the real sequential computation time on a target node for each process one by one. Second, we observe that computation behavior of processes in parallel applications can be clustered into a few groups while processes in each group have similar computation behavior. This observation helps us reduce measurement time significantly because we only need to execute representative parallel processes instead of all of them. We have implemented a performance prediction framework, called PHANTOM, which integrates the above computation-time acquisition approach with a trace-driven network simulator. We validate our approach on several platforms. For ASCI Sweep3D, the error of our approach is less than 5% on 1024 processor cores. Compared to a recent regression-based prediction approach, PHANTOM presents better prediction accuracy across different platforms.