CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Mercury: Combining Performance with Dependability Using Self-virtualization
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
ZooKeeper: wait-free coordination for internet-scale systems
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models
PDP '11 Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing
Evaluating the performance and scalability of mapreduce applications on X10
APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
X10 as a Parallel Language for Scientific Computation: Practice and Experience
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Introducing ScaleGraph: an X10 library for billion scale graph analytics
Proceedings of the 2012 ACM SIGPLAN X10 Workshop
Resilient X10: efficient failure-aware programming
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
The emergence of multicore machines has made exploiting parallelism a necessity to harness the abundant computing resources in both a single machine and clusters. This, however, may hinder programming productivities as threaded and distributed programming is hard to use correctly and concurrency/distributed bugs are hard to spot. Asynchronous partitioned global address space (APGAS) model is a programming model aiming at unifying programming for multicore and clusters at good productivity. Unfortunately, the current implementation of APGAS programming model lacks support for fault tolerance and a single transient failure may render hours to months of computation useless. In this paper, we make the first attempt to add fault tolerance support to APGAS programming models by integrating advances in fault-tolerant distributed systems to an APGAS language called X10. We thoroughly analyze the feasibility of providing fault tolerance for X10. Based on the analysis, we design and implement a fault-tolerance framework called X10-FT that leverages advances in distributed systems like distributed file systems and PAXOS, as well as specific solutions based on the characteristics of the APGAS model to make checkpoints and consensus, which allows transparently handling machine failures in different granularities. Using the features of the APGAS model, we extend the X10 compiler to automatically locate execution point to checkpoint program states without intervention from the user. We also provide a preliminary evaluation show the cost of providing fault-tolerance in X10-FT.