A study of application-level recovery methods for transient network faults

Authors:
Ignacio Laguna;Edgar A. León;Martin Schulz;Mark Stephenson
Affiliations:
Lawrence Livermore National Laboratory;Lawrence Livermore National Laboratory;Lawrence Livermore National Laboratory;IBM Research Austin
Venue:
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Year:
2013

Citing 17
Cited 0

Software overhead in messaging layers: where does the time go?

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics
Algorithm-Based Fault Tolerance for FFT Networks

IEEE Transactions on Computers
An Efficient Algorithm-Based Concurrent Error Detection for FFT Networks

IEEE Transactions on Computers
Experimental Study of Internet Stability and Backbone Failures

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Tolerating Network Failures in System Area Networks

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters

Proceedings of the 21st annual international conference on Supercomputing
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Characterization of failures in an operational IP backbone network

IEEE/ACM Transactions on Networking (TON)
Zero-copy protocol for MPI using infiniband unreliable datagram

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Fault Tolerant Parallel FFT Using Parallel Failure Recovery

ICCSA '09 Proceedings of the 2009 International Conference on Computational Science and Its Applications
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Dlib-ml: A Machine Learning Toolkit

The Journal of Machine Learning Research
California fault lines: understanding the causes and impact of network failures

Proceedings of the ACM SIGCOMM 2010 conference
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the increasing number of components in HPC systems, transient faults will become commonplace. Today, network transient faults, such as lost or corrupted network packets, are addressed by middleware libraries at the cost of high memory usage and packet retransmissions. These costs, however, can be eliminated using application-level fault tolerance. In this paper, we propose recovery methods for transient network faults at the application level. These methods reconstruct missing or corrupted data via interpolation. We derive a realistic fault model using network fault rates from a production HPC cluster and use it to demonstrate the effectiveness of our reconstruction methods in an FFT kernel. We found that the normalized root-mean-square error for FFT computations can be as low as 0.1% and, thus, demonstrates that network faults can be handled at the application level with low perturbation in applications that have smoothness in their computed data.