Executing multithreaded programs efficiently
Executing multithreaded programs efficiently
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
ATLAS: an infrastructure for global computing
EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
The N-Version Approach to Fault-Tolerant Software
IEEE Transactions on Software Engineering
PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures
IEEE Transactions on Dependable and Secure Computing
The design of a task parallel library
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Satin: A high-level and efficient grid programming model
ACM Transactions on Programming Languages and Systems (TOPLAS)
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
SpiceC: scalable parallelism via implicit copying and explicit commit
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Lifeline-based global load balancing
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Work stealing for multi-core HPC clusters
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Hi-index | 0.00 |
Fault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task scheduling framework which supports hardware fault tolerance. In this framework, both transient and permanent faults are detected and recovered at task granularity. We incorporate task-based fault detection and recovery mechanisms into a hierarchical work-stealing scheme to establish the framework. This framework provides low-overhead fault-tolerance and optimal load balancing by fully exploiting task parallelism.