Parallel reduction to hessenberg form with algorithm-based fault tolerance

Authors:
Yulu Jia;George Bosilca;Piotr Luszczek;Jack J. Dongarra
Affiliations:
University of Tennessee, Knoxville;University of Tennessee, Knoxville;University of Tennessee, Knoxville;University of Tennessee, Knoxville, Oak Ridge National Laboratory and University of Manchester
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 32
Cited 0

Fault-Tolerant Matrix Triangularizations on Systolic Arrays

IEEE Transactions on Computers
The algebraic eigenvalue problem

The algebraic eigenvalue problem
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
Floating Point Fault Tolerance with Backward Error Assertions

IEEE Transactions on Computers - Special issue on fault-tolerant computing
ScaLAPACK user's guide

ScaLAPACK user's guide
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Matrix algorithms

Matrix algorithms
The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance

SIAM Journal on Matrix Analysis and Applications
The Multishift QR Algorithm. Part II: Aggressive Early Deflation

SIAM Journal on Matrix Analysis and Applications
The WY representation for products of householder matrices

Selected Papers from the Second Conference on Parallel Processing for Scientific Computing
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Understanding Search Engines: Mathematical Modeling and Text Retrieval (Software, Environments, Tools), Second Edition

Understanding Search Engines: Mathematical Modeling and Text Retrieval (Software, Environments, Tools), Second Edition
Rounding Errors in Algebraic Processes

Rounding Errors in Algebraic Processes
Scalable diskless checkpointing for large parallel systems

Scalable diskless checkpointing for large parallel systems
Google's PageRank and Beyond: The Science of Search Engine Rankings

Google's PageRank and Beyond: The Science of Search Engine Rankings
The $25,000,000,000 Eigenvector: The Linear Algebra behind Google

SIAM Review
Scalable techniques for fault tolerant high performance computing

Scalable techniques for fault tolerant high performance computing
A tutorial on spectral clustering

Statistics and Computing
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

SIAM Journal on Scientific Computing
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms

SIAM Journal on Scientific Computing
Distributed Diskless Checkpoint for Large Scale Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems

SIAM Journal on Scientific Computing
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
On aggressive early deflation in parallel variants of the QR algorithm

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.