Fault-Tolerant Matrix Triangularizations on Systolic Arrays
IEEE Transactions on Computers
The algebraic eigenvalue problem
The algebraic eigenvalue problem
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
Floating Point Fault Tolerance with Backward Error Assertions
IEEE Transactions on Computers - Special issue on fault-tolerant computing
ScaLAPACK user's guide
IEEE Transactions on Parallel and Distributed Systems
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Matrix algorithms
The Multishift QR Algorithm. Part I: Maintaining Well-Focused Shifts and Level 3 Performance
SIAM Journal on Matrix Analysis and Applications
The Multishift QR Algorithm. Part II: Aggressive Early Deflation
SIAM Journal on Matrix Analysis and Applications
The WY representation for products of householder matrices
Selected Papers from the Second Conference on Parallel Processing for Scientific Computing
Fault tolerant matrix operations using checksum and reverse computation
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Understanding Search Engines: Mathematical Modeling and Text Retrieval (Software, Environments, Tools), Second Edition
Rounding Errors in Algebraic Processes
Rounding Errors in Algebraic Processes
Scalable diskless checkpointing for large parallel systems
Scalable diskless checkpointing for large parallel systems
Google's PageRank and Beyond: The Science of Search Engine Rankings
Google's PageRank and Beyond: The Science of Search Engine Rankings
Scalable techniques for fault tolerant high performance computing
Scalable techniques for fault tolerant high performance computing
A tutorial on spectral clustering
Statistics and Computing
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
SIAM Journal on Scientific Computing
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms
SIAM Journal on Scientific Computing
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
A Novel Parallel QR Algorithm for Hybrid Distributed Memory HPC Systems
SIAM Journal on Scientific Computing
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
On aggressive early deflation in parallel variants of the QR algorithm
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.