Reliability challenges in large systems

  • Authors:
  • Daniel A. Reed;Charng-da Lu;Celso L. Mendes

  • Affiliations:
  • Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC;Department of Computer Science, University of Illinois, Urbana, IL;Department of Computer Science, University of Illinois, Urbana, IL

  • Venue:
  • Future Generation Computer Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of nodes, the assumption of fully reliable hardware and software becomes much less credible. In this paper, after presenting examples and experimental data that quantify the reliability of current systems, we describe possible approaches for effective system use. In particular, we present techniques for detecting imminent failures in the environment and that allow an application to run successfully despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.