A multi-level scalable startup for parallel applications

Authors:
Abhishek Gupta;Gengbin Zheng;Laxmikant V. Kalé
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Year:
2011

Citing 10
Cited 1

CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Scalable parallel application launch on Cplant™

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A Scalable Process-Management Environment for Parallel Programs

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
STORM: lightning-fast resource management

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Sun Grid Engine: Towards Creating a Compute Power Grid

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
GXP: An Interactive Shell for the Grid Environment

IWIA '04 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems
A fault tolerance protocol for fast recovery

A fault tolerance protocol for fast recovery
TakTuk, adaptive deployment of remote executions

Proceedings of the 18th ACM international symposium on High performance distributed computing
ScELA: scalable and extensible launching architecture for clusters

HiPC'08 Proceedings of the 15th international conference on High performance computing

LIBI: A framework for bootstrapping extreme scale software systems

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

High performance parallel machines with hundreds of thousands of processors and petascale performance are already in use, and even larger Exaflops scale computing systems which may have hundreds of millions of cores are planned. To run parallel applications on machines of such massive scale, one of the biggest challenges is the parallel startup process. This task involves two components: (1) parallel launching of appropriate processes on the given set of processors and (2) setting up communication channels to enable the processes to communicate with each other after process launching has completed. Most current startup mechanisms focus on either using special purpose daemons which waste system resources or using a startup manager which becomes a scalability bottleneck. In this paper, we investigate the design and scalability of a SMP-aware, multi-level startup scheme with batching of remote shell sessions, which provides a complete solution to startup of a parallel application and facilitates its management during execution. It monitors process health and can be used to support recovery from failures and provide scalable interaction with the application. We demonstrate the performance and scalability of this scheme by applying it to startup Charm++ applications. In particular, starting up a Charm++ program on 16,384 cores of Ranger (at TACC) with Ethernet as the underlying communication layer takes only 25 seconds and attains a speedup of over 400% compared to MPICH2 startup (using hydra as process manager) and over 800% compared to Open MPI startup on Ranger.