Poster: FOX: a fault-oblivious extreme scale execution environment

  • Authors:
  • Ronald G. Minnich;Curtis L. Janssen;Sriram Krishnamoorthy;Andres Marquez;Wenjing Ma;Maya Gokhale;Ponnuswamy Sadayappan;Eric Van Hensbergen;Jonathan Appavoo;Jim Mckie

  • Affiliations:
  • Sandia National Laboratories, Livermore, USA;Sandia National Laboratories, Livermore, USA;Pacific Northwest National Laboratory, Richland, USA;Pacific Northwest National Laboratory, Richland, USA;Pacific Northwest National Laboratory, Richland, USA;Lawrence Livermore National Laboratory, Livermore, USA;Ohio State University, Columbus, USA;IBM, Austin, USA;Boston University, Boston, USA;Alcatel-Lucent, Murray Hill, USA

  • Venue:
  • Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today's machines. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable hardware environment with billions of threads of execution. Further, these systems must be designed with failure in mind. FOX is a new system for the exascale that will support distributed data objects as first class objects in the operating system itself. This memory-based data store will be named and accessed as part of the file system name space of the application. We can build many types of objects with this data store, including data-driven work queues, which will in turn support applications with inherent resilience.