X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Concurrent clustered programming
CONCUR 2005 - Concurrency Theory
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Hadoop: The Definitive Guide
Pregel: a system for large-scale graph processing
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Spark: cluster computing with working sets
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Lifeline-based global load balancing
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
X10 as a Parallel Language for Scientific Computation: Practice and Experience
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Proceedings of the 2011 ACM SIGPLAN X10 Workshop
Proving acceptability properties of relaxed nondeterministic approximate programs
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
M3R: increased performance for in-memory Hadoop jobs
Proceedings of the VLDB Endowment
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Adoption protocols for fanout-optimal fault-tolerant termination detection
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
X10-FT: transparent fault tolerance for APGAS language and runtime
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Java interoperability in managed X10
Proceedings of the third ACM SIGPLAN X10 Workshop
MillWheel: fault-tolerant stream processing at internet scale
Proceedings of the VLDB Endowment
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Scale-out programs run on multiple processes in a cluster. In scale-out systems, processes can fail. Computations using traditional libraries such as MPI fail when any component process fails. The advent of Map Reduce, Resilient Data Sets and MillWheel has shown dramatic improvements in productivity are possible when a high-level programming framework handles scale-out and resilience automatically. We are concerned with the development of general-purpose languages that support resilient programming. In this paper we show how the X10 language and implementation can be extended to support resilience. In Resilient X10, places may fail asynchronously, causing loss of the data and tasks at the failed place. Failure is exposed through exceptions. We identify a {\em Happens Before Invariance Principle} and require the runtime to automatically repair the global control structure of the program to maintain this principle. We show this reduces much of the burden of resilient programming. The programmer is only responsible for continuing execution with fewer computational resources and the loss of part of the heap, and can do so while taking advantage of domain knowledge. We build a complete implementation of the language, capable of executing benchmark applications on hundreds of nodes. We describe the algorithms required to make the language runtime resilient. We then give three applications, each with a different approach to fault tolerance (replay, decimation, and domain-level checkpointing). These can be executed at scale and survive node failure. We show that for these programs the overhead of resilience is a small fraction of overall runtime by comparing to equivalent non-resilient X10 programs. On one program we show end-to-end performance of Resilient X10 is ~100x faster than Hadoop.