Data-driven fault tolerance for work stealing computations

Authors:
Wenjing Ma;Sriram Krishnamoorthy
Affiliations:
Pacific Northwest National Laboratory, Richland, WA, USA;Pacific Northwest National Laboratory, Richland, WA, USA
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 28
Cited 1

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The design and implementation of a log-structured file system

ACM Transactions on Computer Systems (TOCS)
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Algorithm-Based Fault Tolerance for FFT Networks

IEEE Transactions on Computers
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Portable high-performance programs

Portable high-performance programs
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Transparent checkpoint-restart of multiple processes on commodity operating systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Intel threading building blocks

Intel threading building blocks
The Design of OpenMP Tasks

IEEE Transactions on Parallel and Distributed Systems
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Lifeline-based global load balancing

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Adoption protocols for fanout-optimal fault-tolerant termination detection

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Work stealing is a promising technique to dynamically tolerate variations in the execution environment, including faults, system noise, and energy constraints. In this paper, we present fault tolerance mechanisms for task parallel computations, a popular computation idiom, employing work stealing. The computation is organized as a collection of tasks with data in a global address space. The completion of data operations, rather than the actual messages, is tracked to derive an idempotent data store. This information is also used to accurately identify the tasks to be re-executed in the presence of random work stealing. We consider three recovery schemes that present distinct trade-offs --- lazy recovery with potentially increased re-execution cost, immediate collective recovery with associated synchronization overheads, and noncollective recovery enabled by additional communication. We employ distributed-memory work stealing to dynamically rebalance the tasks onto the live processes and evaluate the three schemes using candidate application benchmarks. We demonstrate that the overheads (space and time) of the fault tolerance mechanism are low, the costs incurred due to failures are small, and the overheads decrease with per-process work at scale.