HPC environment management: new challenges in the petaflop era

Authors:
Jonas Dias;Albino Aveleda
Affiliations:
Federal University of Rio de Janeiro, Centro de Tecnologia, Rio de Janeiro, Brazil;Federal University of Rio de Janeiro, Centro de Tecnologia, Rio de Janeiro, Brazil
Venue:
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Year:
2010

Citing 10
Cited 0

Groupware: some issues and experiences

Communications of the ACM
MPICH2: A New Start for MPI Implementations

Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Sun Grid Engine: Towards Creating a Compute Power Grid

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
NPACI Rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters

CLUSTER '01 Proceedings of the 3rd IEEE International Conference on Cluster Computing
Building Rich Web Applications with Ajax

Computer
TORQUE resource manager

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Zimbra collaboration suite, Version 4.5

Linux Journal
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Nagios: System and Network Monitoring

Nagios: System and Network Monitoring
ZK Step-By-Step: Ajax without JavaScript Framework

ZK Step-By-Step: Ajax without JavaScript Framework

Quantified Score

Hi-index	0.00

Visualization

Abstract

High Performance Computing (HPC) is becoming much more popular nowadays. Currently, the biggest supercomputers in the world have hundreds of thousands of processors and consequently may have more software and hardware failures. HPC centers managers also have to deal with multiple clusters from different vendors with their particular architectures. However, since there are not enough HPC experts to manage all the new supercomputers, it is expected that non-experts will be managing those large clusters. In this paper we study the new challenges to manage HPC environments containing different clusters with different sizes and architectures. We review available tools and present LEMMing [1], an easy-to-use open source tool developed to support high performance computing centers. LEMMing integrates machine resources and the available management and monitoring tools on a single point of management.