Mache: no-loss trace compaction
SIGMETRICS '89 Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Exploiting Lustre File Joining for Effective Collective IO
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Leveraging non-blocking collective communication in high-performance applications
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
ScalaTrace: Scalable compression and replay of communication traces for high-performance computing
Journal of Parallel and Distributed Computing
A Holistic Approach for Performance Measurement and Analysis for Petascale Applications
ICCS 2009 Proceedings of the 9th International Conference on Computational Science
Space-efficient time-series call-path profiling of parallel applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Evaluating similarity-based trace reduction techniques for scalable performance analysis
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Compressible memory data structures for event-based trace analysis
Future Generation Computer Systems
Capturing and visualizing event flow graphs of MPI applications
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Introducing the open trace format (OTF)
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Automatic structure extraction from MPI applications tracefiles
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A case for standard non-blocking collective operations
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Hi-index | 0.00 |
Performance analysis of applications on modern high-end petascale systems is increasingly challenging due to the rising complexity and quantity of the computing units. This paper presents a performance-analysis study using the Vampir performance-analysis tool suite, which examines application behavior as well as the fundamental system properties. This study was carried out on the Jaguar system at Oak Ridge National Laboratory, the fastest computer on the November 2009 Top500 list. We analyzed the FLASH simulation code that is designed to be scaled with tens of thousands of CPU cores, which means that using existing performance-analysis tools is very complex. The study reveals two classes of performance problems that are relevant for very high CPU counts: MPI communication and scalable I/O. For both, solutions are presented and verified. Finally, the paper proposes improvements and extensions for event tracing tools in order to allow scalability of the tools towards higher degrees of parallelism.