Problem size, parallel architecture, and optimal speedup
Journal of Parallel and Distributed Computing
Speedup Versus Efficiency in Parallel Systems
IEEE Transactions on Computers
IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
Achieving application requirements
Distributed systems
Analysis of scalability of parallel algorithms and architectures: a survey
ICS '91 Proceedings of the 5th international conference on Supercomputing
Task Allocation for Maximizing Reliability of Distributed Computer Systems
IEEE Transactions on Computers
An analytical comparison of periodic checkpointing and incremental state saving
PADS '93 Proceedings of the seventh workshop on Parallel and distributed simulation
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A Variational Calculus Approach to Optimal Checkpoint Placement
IEEE Transactions on Computers
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A widget framework for augmented interaction in SCAPE
Proceedings of the 16th annual ACM symposium on User interface software and technology
Availability Modeling and Analysis on High Performance Cluster Computing Systems
ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle
IEEE Transactions on Dependable and Secure Computing
A comparison of MC/DC, MUMCUT and several other coverage criteria for logical decisions
Journal of Systems and Software - Special issue: Quality software
A Survey on Failure Prediction of Large-Scale Server Clusters
SNPD '07 Proceedings of the Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing - Volume 02
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Hi-index | 0.00 |
One predominant barrier encountered in furthering research and development efforts aimed at facilitating resilient HPC applications is a substantial lack of existing reliability and performance data originating from extreme-scale computing distributions. In order to develop an understanding of how and why highly scaled HPC applications are encountering increasingly frequent performance interruptions, one must conduct extensive trending and analysis on contemporary machines and their associated programs. However, existing HPC application log files are labyrinthine documents that, even with the assistance of intelligent data mining algorithms, translate poorly to human discern. In addition, conventional log filtering processes are limited to execution within a post-mortem, reactive time period, as the enormous size of these documents prevents efficient real time interaction. Thus, there exists a strong need within the HPC field for the provision of accurate-yet-concise real time application information. Moreover, the means of reporting this data must be sufficiently lightweight and non-intrusive, as to successfully-yet-discretely attach itself to the multiple processes running on multiple cores within tens (or in some cases, hundreds) of thousands of compute nodes. Furthermore, this information should in turn be used to facilitate the autonomous correction of application-threatening faults, suspensions, and interruptions. This paper describes a dynamic application instrumentation module (utilizing a combination of Open|SpeedShop software and custom scripting) aimed at successfully achieving these goals.