Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Authors:
Nawab Ali;Sriram Krishnamoorthy;Mahantesh Halappanavar;Jeff Daily
Affiliations:
Pacific Northwest National Laboratory, Richland, WA;Pacific Northwest National Laboratory, Richland, WA;Pacific Northwest National Laboratory, Richland, WA;Pacific Northwest National Laboratory, Richland, WA
Venue:
Proceedings of the 8th ACM International Conference on Computing Frontiers
Year:
2011

Citing 21
Cited 2

Combinatorial optimization: algorithms and complexity

Combinatorial optimization: algorithms and complexity
Average-case analysis of algorithms for matchings and related problems

Journal of the ACM (JACM)
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
An Efficient Implementation of Edmonds' Algorithm for Maximum Matching on Graphs

Journal of the ACM (JACM)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Matching Theory (North-Holland mathematics studies)

Matching Theory (North-Holland mathematics studies)
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

International Journal of High Performance Computing Applications
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Algorithm-Based Fault Tolerance for Fail-Stop Failures

IEEE Transactions on Parallel and Distributed Systems
Assignment Problems

Assignment Problems
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Proactive Fault Tolerance Using Preemptive Migration

PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Algorithms for vertex-weighted matching in graphs

Algorithms for vertex-weighted matching in graphs
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models

PDP '11 Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing

Application-specific fault tolerance via data access characterization

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Poster: FOX: a fault-oblivious extreme scale execution environment

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.