X10-FT: transparent fault tolerance for APGAS language and runtime

Authors:
Chenning Xie;Zhijun Hao;Haibo Chen
Affiliations:
Shanghai Jiao Tong University;Shanghai Jiao Tong University and Fudan University;Shanghai Jiao Tong University
Venue:
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Year:
2013

Citing 12
Cited 1

CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Proactive fault tolerance for HPC with Xen virtualization

Proceedings of the 21st annual international conference on Supercomputing
Mercury: Combining Performance with Dependability Using Self-virtualization

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models

PDP '11 Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing
Evaluating the performance and scalability of mapreduce applications on X10

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
X10 as a Parallel Language for Scientific Computation: Practice and Experience

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Introducing ScaleGraph: an X10 library for billion scale graph analytics

Proceedings of the 2012 ACM SIGPLAN X10 Workshop

Resilient X10: efficient failure-aware programming

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

The emergence of multicore machines has made exploiting parallelism a necessity to harness the abundant computing resources in both a single machine and clusters. This, however, may hinder programming productivities as threaded and distributed programming is hard to use correctly and concurrency/distributed bugs are hard to spot. Asynchronous partitioned global address space (APGAS) model is a programming model aiming at unifying programming for multicore and clusters at good productivity. Unfortunately, the current implementation of APGAS programming model lacks support for fault tolerance and a single transient failure may render hours to months of computation useless. In this paper, we make the first attempt to add fault tolerance support to APGAS programming models by integrating advances in fault-tolerant distributed systems to an APGAS language called X10. We thoroughly analyze the feasibility of providing fault tolerance for X10. Based on the analysis, we design and implement a fault-tolerance framework called X10-FT that leverages advances in distributed systems like distributed file systems and PAXOS, as well as specific solutions based on the characteristics of the APGAS model to make checkpoints and consensus, which allows transparently handling machine failures in different granularities. Using the features of the APGAS model, we extend the X10 compiler to automatically locate execution point to checkpoint program states without intervention from the user. We also provide a preliminary evaluation show the cost of providing fault-tolerance in X10-FT.