Reliability challenges in large systems

Authors:
Daniel A. Reed;Charng-da Lu;Celso L. Mendes
Affiliations:
Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC;Department of Computer Science, University of Illinois, Urbana, IL;Department of Computer Science, University of Illinois, Urbana, IL
Venue:
Future Generation Computer Systems
Year:
2006

Citing 20
Cited 5

Debugging Parallel Programs with Instant Replay

IEEE Transactions on Computers
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
QoS-aware resource management for distributed multimedia applications

Journal of High Speed Networks - Special issue on multimedia networking
A scalable cross-platform infrastructure for application performance tuning using hardware counters

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A network-failure-tolerant message-passing system for terascale clusters

ICS '02 Proceedings of the 16th international conference on Supercomputing
Fault Injection Techniques and Tools

Computer
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Toward a Framework for Preparing and Executing Adaptive Grid Programs

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Compact application signatures for parallel and distributed scientific codes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
MAGNET: A Tool for Debugging, Analyzing and Adapting Computing Systems

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
The Bladed Beowulf: A Cost-Effective Alternative to Traditional Beowulfs

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Autopilot: Adaptive Control of Distributed Applications

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Monitoring hard disks with smart

Linux Journal
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Real-Time Performance Monitoring, Adaptive Control, and Interactive Steering of Computational Grids

International Journal of High Performance Computing Applications
Monitoring Large Systems Via Statistical Sampling

International Journal of High Performance Computing Applications
A model for predicting the optimum checkpoint interval for restart dumps

ICCS'03 Proceedings of the 2003 international conference on Computational science
The cactus framework and toolkit: design and applications

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science

Efficient task replication and management for adaptive fault tolerance in mobile Grid environments

Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Optimal task partition and distribution in grid service system with common cause failures

Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Performability modeling for scheduling and fault tolerance strategies for scientific workflows

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Characterizing fault tolerance in genetic programming

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of nodes, the assumption of fully reliable hardware and software becomes much less credible. In this paper, after presenting examples and experimental data that quantify the reliability of current systems, we describe possible approaches for effective system use. In particular, we present techniques for detecting imminent failures in the environment and that allow an application to run successfully despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.