Lessons learned at 208K: towards debugging millions of cores
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A lightweight library for building scalable tools
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
A Scalable Parallel Debugging Library with Pluggable Communication Protocols
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Optimizing latency and throughput for spawning processes on massively multicore processors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
LIBI: A framework for bootstrapping extreme scale software systems
Parallel Computing
Proceedings of the 27th international ACM conference on International conference on supercomputing
Optimizing process creation and execution on multi-core architectures
International Journal of High Performance Computing Applications
Distributed wait state tracking for runtime MPI deadlock detection
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Many tools that target parallel and distributed environments must co-locate a set of daemons with the distributed processes of the target application. However, efficient and portable deployment of these daemons on large scale systems is an unsolved problem. We overcome this gap with LaunchMON, a scalable, robust, portable, secure, and general purpose infrastructure for launching tool daemons. Its API allows tool builders to identify all processes of a target job, launch daemons on the relevant nodes and control daemon interaction. Our results show that LaunchMON scales to very large daemon counts and substantially enhances performance over existing ad hoc mechanisms.