Debugging Parallel Programs with Instant Replay
IEEE Transactions on Computers
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
IEEE Transactions on Parallel and Distributed Systems
QoS-aware resource management for distributed multimedia applications
Journal of High Speed Networks - Special issue on multimedia networking
A scalable cross-platform infrastructure for application performance tuning using hardware counters
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A network-failure-tolerant message-passing system for terascale clusters
ICS '02 Proceedings of the 16th international conference on Supercomputing
Fault Injection Techniques and Tools
Computer
Toward a Framework for Preparing and Executing Adaptive Grid Programs
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Compact application signatures for parallel and distributed scientific codes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
MAGNET: A Tool for Debugging, Analyzing and Adapting Computing Systems
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
The Bladed Beowulf: A Cost-Effective Alternative to Traditional Beowulfs
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Autopilot: Adaptive Control of Distributed Applications
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Monitoring hard disks with smart
Linux Journal
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Real-Time Performance Monitoring, Adaptive Control, and Interactive Steering of Computational Grids
International Journal of High Performance Computing Applications
Monitoring Large Systems Via Statistical Sampling
International Journal of High Performance Computing Applications
A model for predicting the optimum checkpoint interval for restart dumps
ICCS'03 Proceedings of the 2003 international conference on Computational science
The cactus framework and toolkit: design and applications
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Quasi-opportunistic Supercomputing in Grid Environments
ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
Editorial: Special Section: Future Generation Information Technology
Future Generation Computer Systems
Experimental study of resilient algorithms and data structures
SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Resilient algorithms and data structures
CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Economic-based resource allocation for reliable Grid-computing service based on Grid Bank
Future Generation Computer Systems
Fault tolerance logical network properties of irregular graphs
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Predictable quality of service atop degradable distributed systems
Cluster Computing
Communication and migration energy aware task mapping for reliable multiprocessor systems
Future Generation Computer Systems
A job submission manager for large-scale distributed systems based on job futurity predictor
International Journal of Grid and Utility Computing
Hi-index | 0.00 |
Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of nodes, the assumption of fully reliable hardware and software becomes much less credible. In this paper, after presenting examples and experimental data that quantify the reliability of current systems, we describe possible approaches for effective system use. In particular, we present techniques for detecting imminent failures in the environment and that allow an application to run successfully despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.