ACM Transactions on Computer Systems (TOCS)
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Venti: A New Approach to Archival Storage
FAST '02 Proceedings of the Conference on File and Storage Technologies
How to Build a Highly Available System Using Consensus
WDAG '96 Proceedings of the 10th International Workshop on Distributed Algorithms
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Supermon: A High-Speed Cluster Monitoring System
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Analysis of microbenchmarks for performance tuning of clusters
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
P.R.O.S.E.: partitioned reliable operating system environment
ACM SIGOPS Operating Systems Review
Right-weight kernels: an off-the-shelf alternative to custom light-weight kernels
ACM SIGOPS Operating Systems Review
HPC-Colony: services and interfaces for very large systems
ACM SIGOPS Operating Systems Review
MOLAR: adaptive runtime support for high-end computing operating and runtime systems
ACM SIGOPS Operating Systems Review
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Libra: a library operating system for a jvm in a virtualized execution environment
Proceedings of the 3rd international conference on Virtual execution environments
The Chubby lock service for loosely-coupled distributed systems
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
A unified execution model for cloud computing
ACM SIGOPS Operating Systems Review
Kittyhawk: enabling cooperation and competition in a global, shared computational system
IBM Journal of Research and Development
Hi-index | 0.00 |
Within a few short years, we can expect to be dealing with multi-million-thread programs running on million-core systems [16]. This will no doubt stress the contemporary HPC software model which was developed in a time when 512 cores was a large number. Historical approaches have been further challenged by the increased desire of developers and end users for supercomputer light weight kernels (LWKs) to support the same environment, libraries, and tools as their desktops. As a result, the emerging workloads of today are far more sophisticated than those of the last two decades when much of the HPC infrastructure was developed, and feature the use of scripting environments such as Python, dynamic libraries, and complex multi-scale physics frameworks. Complicating this picture is the overwhelming management, monitoring and reliability problem created by the huge number of nodes in a system of that magnitude. We believe that a re-evaluation and exploration of distributed system principals is called for in order to address the challenges of ultrascale. To that end we will be evaluating and extending the Plan 9 [21] distributed system on the largest machines available to us, namely the BG/L [28] and BG/P [10] supercomputers. We have chosen Plan 9 based on our previous experiences with it in combination with previous research [17] which determined Plan 9 was a "right weight kernel", balancing trade offs between LWKs and more general purpose operating systems such as Linux. To deal with issues of scale, we plan on leveraging the use of the high-performance interconnects by system services as well as exploring aggregation as more of a first-class system construct -- providing dynamic hierarchical organization and management of all resources. Our plan is to evaluate the viability of these concepts at scale as well as create an alternative development and execution environment which compliments the features and capabilities of the existing system software and run time options. Our intent is to broaden the application base as well as make the system as a whole more approachable to a larger class of developers and end-users.