MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Lightweight monitoring of MPI programs in real time: Research Articles
Concurrency and Computation: Practice & Experience
The Tau Parallel Performance System
International Journal of High Performance Computing Applications
Open|SpeedShop: open source performance analysis for Linux clusters
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Problem diagnosis in large-scale computing environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Lessons learned at 208K: towards debugging millions of cores
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Hi-index | 0.00 |
Monitoring distributed programs on high performance supercomputers is a challenging task, yet it is essential for the proper administration of the machines and for users to understand what their program is doing on production runs. To this end, we created a flexible monitoring capability for a major class of scientific applications, programs using MPI, that efficiently gathers information from the distributed program and collects it at a central point. This data can then be used to both understand application-centric issues and system-centric issues; and for improvement, administration, and maintenance of both the complex applications producing important scientific results and the complex systems that execute them.