Holistic aggregate resource environment

  • Authors:
  • Eric Van Hensbergen;Charles Forsyth;Jim McKie;Ron Minnich

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ACM SIGOPS Operating Systems Review
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Within a few short years, we can expect to be dealing with multi-million-thread programs running on million-core systems [16]. This will no doubt stress the contemporary HPC software model which was developed in a time when 512 cores was a large number. Historical approaches have been further challenged by the increased desire of developers and end users for supercomputer light weight kernels (LWKs) to support the same environment, libraries, and tools as their desktops. As a result, the emerging workloads of today are far more sophisticated than those of the last two decades when much of the HPC infrastructure was developed, and feature the use of scripting environments such as Python, dynamic libraries, and complex multi-scale physics frameworks. Complicating this picture is the overwhelming management, monitoring and reliability problem created by the huge number of nodes in a system of that magnitude. We believe that a re-evaluation and exploration of distributed system principals is called for in order to address the challenges of ultrascale. To that end we will be evaluating and extending the Plan 9 [21] distributed system on the largest machines available to us, namely the BG/L [28] and BG/P [10] supercomputers. We have chosen Plan 9 based on our previous experiences with it in combination with previous research [17] which determined Plan 9 was a "right weight kernel", balancing trade offs between LWKs and more general purpose operating systems such as Linux. To deal with issues of scale, we plan on leveraging the use of the high-performance interconnects by system services as well as exploring aggregation as more of a first-class system construct -- providing dynamic hierarchical organization and management of all resources. Our plan is to evaluate the viability of these concepts at scale as well as create an alternative development and execution environment which compliments the features and capabilities of the existing system software and run time options. Our intent is to broaden the application base as well as make the system as a whole more approachable to a larger class of developers and end-users.