Supporting Fault-Tolerant Parallel Programming in Linda
IEEE Transactions on Parallel and Distributed Systems
Global arrays: a portable "shared-memory" programming model for distributed memory computers
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
BOINC: A System for Public-Resource Computing and Storage
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Fault tolerant high performance computing by a coding approach
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Adaptive and reliable parallel computing on networks of workstations
ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
A Framework for Proactive Fault Tolerance
ARES '08 Proceedings of the 2008 Third International Conference on Availability, Reliability and Security
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Scioto: A Framework for Global-View Task Parallelism
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Compiler-support for robust multi-core computing
ISoLA'10 Proceedings of the 4th international conference on Leveraging applications of formal methods, verification, and validation - Volume Part I
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Hi-index | 0.00 |
We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tracking mechanism. Compared with conventional checkpoint/restart techniques, this system offers a recovery penalty that is proportional to the degree of failure rather than the system size. We evaluate this system using the Self Consistent Field (SCF) kernel which forms an important component in ab initio methods for computational chemistry. Experimental results indicate that fault tolerant task pools are robust in the presence of an arbitrary number of failures and that they offer low overhead in the absence of faults.