LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
HARNESS and fault tolerant MPI
Parallel Computing - Clusters and computational grids for scientific computing
Distributed Peer-to-Peer Control in Harness
ICCS '02 Proceedings of the International Conference on Computational Science-Part II
CLADE '03 Proceedings of the 1st International Workshop on Challenges of Large Applications in Distributed Environments
Total order broadcast and multicast algorithms: Taxonomy and survey
ACM Computing Surveys (CSUR)
A Lightweight Kernel for the Harness Metacomputing Framework
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Active/Active Replication for Highly Available HPC System Services
ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
Super-Scalable algorithms for computing on 100,000 processors
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
RMIX: a dynamic, heterogeneous, reconfigurable communication framework
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II
Holistic aggregate resource environment
ACM SIGOPS Operating Systems Review
Symmetric active/active metadata service for highly available cluster storage systems
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Hi-index | 0.00 |
MOLAR is a multi-institutional research effort that concentrates on adaptive, reliable, and efficient operating and runtime system (OS/R) solutions for ultra-scale high-end scientific computing on the next generation of supercomputers. This research addresses the challenges outlined in FAST-OS (forum to address scalable technology for runtime and operating systems) and HECRTF (high-end computing revitalization task force) activities by exploring the use of advanced monitoring and adaptation to improve application performance and predictability of system interruptions, and by advancing computer reliability, availability and serviceability (RAS) management systems to work cooperatively with the OS/R to identify and preemptively resolve system issues. This paper describes recent research of the MOLAR team in advancing RAS for high-end computing OS/Rs.