Resilient X10: efficient failure-aware programming

Authors:
David Cunningham;David Grove;Benjamin Herta;Arun Iyengar;Kiyokuni Kawachiya;Hiroki Murata;Vijay Saraswat;Mikio Takeuchi;Olivier Tardieu
Affiliations:
Google, Inc, New York, NY, USA;IBM T.J. Watson Research Center, Yorktown, NY, USA;IBM T.J. Watson Research Center, Yorktown, NY, USA;IBM T.J. Watson Research Center, Yorktown, NY, USA;IBM Research - Tokyo, Tokyo, Japan;IBM Research - Tokyo, Tokyo, Japan;IBM T.J. Watson Research Center, Yorktown, NY, USA;IBM Research - Tokyo, Tokyo, Japan;IBM T.J. Watson Research Center, Yorktown, NY, USA
Venue:
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2014

Citing 18
Cited 0

X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Concurrent clustered programming

CONCUR 2005 - Concurrency Theory
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Lifeline-based global load balancing

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
X10 as a Parallel Language for Scientific Computation: Practice and Experience

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Compiling X10 to Java

Proceedings of the 2011 ACM SIGPLAN X10 Workshop
Proving acceptability properties of relaxed nondeterministic approximate programs

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
M3R: increased performance for in-memory Hadoop jobs

Proceedings of the VLDB Endowment
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Adoption protocols for fanout-optimal fault-tolerant termination detection

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
X10-FT: transparent fault tolerance for APGAS language and runtime

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Java interoperability in managed X10

Proceedings of the third ACM SIGPLAN X10 Workshop
MillWheel: fault-tolerant stream processing at internet scale

Proceedings of the VLDB Endowment
X10 and APGAS at Petascale

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framework handles scale-out and resilience automatically. We are concerned with the development of general-purpose languages that support resilient programming. In this paper we show how the X10 language and implementation can be extended to support resilience. In Resilient X10, places may fail asynchronously, causing loss of the data and tasks at the failed place. Failure is exposed through exceptions. We identify a {\em Happens Before Invariance Principle} and require the runtime to automatically repair the global control structure of the program to maintain this principle. We show this reduces much of the burden of resilient programming. The programmer is only responsible for continuing execution with fewer computational resources and the loss of part of the heap, and can do so while taking advantage of domain knowledge. We build a complete implementation of the language, capable of executing benchmark applications on hundreds of nodes. We describe the algorithms required to make the language runtime resilient. We then give three applications, each with a different approach to fault tolerance (replay, decimation, and domain-level checkpointing). These can be executed at scale and survive node failure. We show that for these programs the overhead of resilience is a small fraction of overall runtime by comparing to equivalent non-resilient X10 programs. On one program we show end-to-end performance of Resilient X10 is ~100x faster than Hadoop.