X10-FT: transparent fault tolerance for APGAS language and runtime

  • Authors:
  • Chenning Xie;Zhijun Hao;Haibo Chen

  • Affiliations:
  • Shanghai Jiao Tong University;Shanghai Jiao Tong University and Fudan University;Shanghai Jiao Tong University

  • Venue:
  • Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The emergence of multicore machines has made exploiting parallelism a necessity to harness the abundant computing resources in both a single machine and clusters. This, however, may hinder programming productivities as threaded and distributed programming is hard to use correctly and concurrency/distributed bugs are hard to spot. Asynchronous partitioned global address space (APGAS) model is a programming model aiming at unifying programming for multicore and clusters at good productivity. Unfortunately, the current implementation of APGAS programming model lacks support for fault tolerance and a single transient failure may render hours to months of computation useless. In this paper, we make the first attempt to add fault tolerance support to APGAS programming models by integrating advances in fault-tolerant distributed systems to an APGAS language called X10. We thoroughly analyze the feasibility of providing fault tolerance for X10. Based on the analysis, we design and implement a fault-tolerance framework called X10-FT that leverages advances in distributed systems like distributed file systems and PAXOS, as well as specific solutions based on the characteristics of the APGAS model to make checkpoints and consensus, which allows transparently handling machine failures in different granularities. Using the features of the APGAS model, we extend the X10 compiler to automatically locate execution point to checkpoint program states without intervention from the user. We also provide a preliminary evaluation show the cost of providing fault-tolerance in X10-FT.