Eigenvalues and condition numbers of random matrices
SIAM Journal on Matrix Analysis and Applications
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing
Journal of Parallel and Distributed Computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
A Case for Two-Level Recovery Schemes
IEEE Transactions on Computers
The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
IEEE Transactions on Parallel and Distributed Systems
On the Optimum Checkpoint Interval
Journal of the ACM (JACM)
A first order approximation to the optimum checkpoint interval
Communications of the ACM
Processor allocation and checkpoint interval selection in cluster computing systems
Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Evaluation of checkpoint mechanisms for massively parallel machines
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
MPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard
Fault-tolerant matrix operations for parallel and distributed systems
Fault-tolerant matrix operations for parallel and distributed systems
An Experimental Study about Diskless Checkpointing
EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
Self-adapting software for numerical linear algebra and LAPACK for clusters
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Towards highly available and scalable high performance clusters
Journal of Computer and System Sciences
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Implementing Reliable Data Structures for MPI Services in High Component Count Systems
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Selective Recovery from Failures in a Task Parallel Programming Model
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Parallel fault tolerant algorithms for parabolic problems
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Towards green computing using diskless high performance clusters
Proceedings of the 7th International Conference on Network and Services Management
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Multi-criteria checkpointing strategies: response-time versus resource utilization
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hi-index | 0.00 |
As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever a node fails, have to abort themselves and restart from the beginning or a stable-storage-based checkpoint.This paper explores the use of the floating-point arithmetic coding approach to build fault survivable high performance computing applications so that they can adapt to node failures without aborting themselves. Despite the use of erasure codes over Galois field has been theoretically attempted before in diskless checkpointing, few actual implementations exist. This probably derives from concerns related to both the efficiency and the complexity of implementing such codes in high performance computing applications. In this paper, we introduce the simple but efficient floating-point arithmetic coding approach into diskless checkpointing and address the associated round-off error issue. We also implement a floating-point arithmetic version of the Reed-Solomon coding scheme into a conjugate gradient equation solver and evaluate both the performance and the numerical impact of this scheme. Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact.