Combinatorial optimization: algorithms and complexity
Combinatorial optimization: algorithms and complexity
Average-case analysis of algorithms for matchings and related problems
Journal of the ACM (JACM)
IEEE Transactions on Parallel and Distributed Systems
An Efficient Implementation of Edmonds' Algorithm for Maximum Matching on Graphs
Journal of the ACM (JACM)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Matching Theory (North-Holland mathematics studies)
Matching Theory (North-Holland mathematics studies)
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
International Journal of High Performance Computing Applications
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Algorithm-Based Fault Tolerance for Fail-Stop Failures
IEEE Transactions on Parallel and Distributed Systems
Assignment Problems
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Proactive Fault Tolerance Using Preemptive Migration
PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Algorithms for vertex-weighted matching in graphs
Algorithms for vertex-weighted matching in graphs
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models
PDP '11 Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing
Application-specific fault tolerance via data access characterization
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Poster: FOX: a fault-oblivious extreme scale execution environment
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Hi-index | 0.00 |
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.