Debugging Parallel Programs with Instant Replay
IEEE Transactions on Computers
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
IEEE Transactions on Parallel and Distributed Systems
QoS-aware resource management for distributed multimedia applications
Journal of High Speed Networks - Special issue on multimedia networking
A scalable cross-platform infrastructure for application performance tuning using hardware counters
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A network-failure-tolerant message-passing system for terascale clusters
ICS '02 Proceedings of the 16th international conference on Supercomputing
Fault Injection Techniques and Tools
Computer
Toward a Framework for Preparing and Executing Adaptive Grid Programs
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Compact application signatures for parallel and distributed scientific codes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
MAGNET: A Tool for Debugging, Analyzing and Adapting Computing Systems
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
The Bladed Beowulf: A Cost-Effective Alternative to Traditional Beowulfs
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Autopilot: Adaptive Control of Distributed Applications
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Monitoring hard disks with smart
Linux Journal
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Real-Time Performance Monitoring, Adaptive Control, and Interactive Steering of Computational Grids
International Journal of High Performance Computing Applications
Monitoring Large Systems Via Statistical Sampling
International Journal of High Performance Computing Applications
A model for predicting the optimum checkpoint interval for restart dumps
ICCS'03 Proceedings of the 2003 international conference on Computational science
The cactus framework and toolkit: design and applications
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Efficient task replication and management for adaptive fault tolerance in mobile Grid environments
Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Optimal task partition and distribution in grid service system with common cause failures
Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Performability modeling for scheduling and fault tolerance strategies for scientific workflows
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Characterizing fault tolerance in genetic programming
Future Generation Computer Systems
Hi-index | 0.00 |
Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of nodes, the assumption of fully reliable hardware and software becomes much less credible. In this paper, after presenting examples and experimental data that quantify the reliability of current systems, we describe possible approaches for effective system use. In particular, we present techniques for detecting imminent failures in the environment and that allow an application to run successfully despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.