Industrial strength parallel computing
Industrial strength parallel computing
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
The Inca Test Harness and Reporting Framework
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
High-Performance Computing Acquisitions Based on the Factors that Matter
Computing in Science and Engineering
Hi-index | 0.00 |
Testing is sometimes a forgotten component of system management, but it becomes very important in the realm of High Performance Computing (HPC) clusters. Many large-scale HPC cluster installations are one of a kind, with unknown issues and unexpected behaviors. First, the initial installation may uncover complex configuration interactions that are only apparent at scale; Stability becomes a critical feature of early system testing. Second, Performance may be significantly impacted by small changes to the system. Third, after initial shakeout, users expect a system that is reliable on their terms; ongoing Operational tests verify reliability, and provide early warning of developing problems. A robust test suite should address all of these test categories, and present both tests and results in a manner that meets usability requirements. We will describe Los Alamos National Laboratory's current test suite, and the development project to expand the suite to cover these areas and provide better tools for analysis and reporting.