Reliability challenges in large systems

Authors:
Daniel A. Reed;Charng-da Lu;Celso L. Mendes
Affiliations:
Renaissance Computing Institute, University of North Carolina, Chapel Hill, 27599 NC, USA;Department of Computer Science, University of Illinois, Urbana, 61801 IL, USA;Department of Computer Science, University of Illinois, Urbana, 61801 IL, USA
Venue:
Future Generation Computer Systems
Year:
2006

Citing 20
Cited 9

Debugging Parallel Programs with Instant Replay

IEEE Transactions on Computers
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
QoS-aware resource management for distributed multimedia applications

Journal of High Speed Networks - Special issue on multimedia networking
A scalable cross-platform infrastructure for application performance tuning using hardware counters

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A network-failure-tolerant message-passing system for terascale clusters

ICS '02 Proceedings of the 16th international conference on Supercomputing
Fault Injection Techniques and Tools

Computer
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Toward a Framework for Preparing and Executing Adaptive Grid Programs

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Compact application signatures for parallel and distributed scientific codes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
MAGNET: A Tool for Debugging, Analyzing and Adapting Computing Systems

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
The Bladed Beowulf: A Cost-Effective Alternative to Traditional Beowulfs

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Autopilot: Adaptive Control of Distributed Applications

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Monitoring hard disks with smart

Linux Journal
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Real-Time Performance Monitoring, Adaptive Control, and Interactive Steering of Computational Grids

International Journal of High Performance Computing Applications
Monitoring Large Systems Via Statistical Sampling

International Journal of High Performance Computing Applications
A model for predicting the optimum checkpoint interval for restart dumps

ICCS'03 Proceedings of the 2003 international conference on Computational science
The cactus framework and toolkit: design and applications

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science

Quasi-opportunistic Supercomputing in Grid Environments

ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
Editorial: Special Section: Future Generation Information Technology

Future Generation Computer Systems
Experimental study of resilient algorithms and data structures

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Resilient algorithms and data structures

CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Economic-based resource allocation for reliable Grid-computing service based on Grid Bank

Future Generation Computer Systems
Fault tolerance logical network properties of irregular graphs

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Predictable quality of service atop degradable distributed systems

Cluster Computing
Communication and migration energy aware task mapping for reliable multiprocessor systems

Future Generation Computer Systems
A job submission manager for large-scale distributed systems based on job futurity predictor

International Journal of Grid and Utility Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clusters built from commodity PCs dominate high-performance computing today, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to tens of thousands, with proposed petaflop system likely to contain hundreds of thousands of nodes, the assumption of fully reliable hardware and software becomes much less credible. In this paper, after presenting examples and experimental data that quantify the reliability of current systems, we describe possible approaches for effective system use. In particular, we present techniques for detecting imminent failures in the environment and that allow an application to run successfully despite such failures. We also show how intelligent and adaptive software can lead to failure resilience and efficient system usage.