MOLAR: adaptive runtime support for high-end computing operating and runtime systems

Authors:
Christian Engelmann;Stephen L. Scott;David E. Bernholdt;Narasimha R. Gottumukkala;Chokchai Leangsuksun;Jyothish Varma;Chao Wang;Frank Mueller;Aniruddha G. Shet;P. Sadayappan
Affiliations:
Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN;Louisiana Tech University, Ruston, LA;Louisiana Tech University, Ruston, LA;North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;The Ohio State University, Columbus, OH;The Ohio State University, Columbus, OH
Venue:
ACM SIGOPS Operating Systems Review
Year:
2006

Citing 10
Cited 2

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
HARNESS and fault tolerant MPI

Parallel Computing - Clusters and computational grids for scientific computing
Distributed Peer-to-Peer Control in Harness

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform

CLADE '03 Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments
Total order broadcast and multicast algorithms: Taxonomy and survey

ACM Computing Surveys (CSUR)
A Lightweight Kernel for the Harness Metacomputing Framework

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Active/Active Replication for Highly Available HPC System Services

ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
Super-Scalable algorithms for computing on 100,000 processors

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
RMIX: a dynamic, heterogeneous, reconfigurable communication framework

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II

Holistic aggregate resource environment

ACM SIGOPS Operating Systems Review
Symmetric active/active metadata service for highly available cluster storage systems

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

MOLAR is a multi-institutional research effort that concentrates on adaptive, reliable, and efficient operating and runtime system (OS/R) solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined in FAST-OS (forum to address scalable technology for runtime and operating systems) and HECRTF (high-end computing revitalization task force) activities by exploring the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions, and by advancing computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. This paper describes recent research of the MOLAR team in advancing RAS for high-end computing OS/Rs.