Fault oblivious eXascale whitepaper

  • Authors:
  • Ronald G. Minnich;Curtis L. Janssen;Sriram Krishnamoorthy;Andres Marquez;Maya Gokhale;P. Sadayappan;Eric Van Hensbergen;Jim McKie;Jonathan Appavoo

  • Affiliations:
  • Sandia National Laboratories;Sandia National Laboratories;Pacific Northwest National Laboratory;Pacific Northwest National Laboratory;Lawrence Livermore National Laboratory;Ohio State University;IBM Research;Alcatel-Lucent Bell-Labs;Boston University

  • Venue:
  • Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today's machines[3]. Future systems are expected to feature billions of threads and 10s of millions of CPUs. The nodes and networks of these systems will be hierarchical, and ignoring this hardware hierarchy will lead to poor utilization. Failure will be a constant companion, and it is unlikely that checkpointing the entire system, with its petabytes of memory, will be practical. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis.