Reliable communication in the presence of failures
ACM Transactions on Computer Systems (TOCS)
SIAM Journal on Computing
Cilk: an efficient multithreaded runtime system
Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
SkIE: a heterogeneous environment for HPC applications
Parallel Computing - Special Anniversary issue
ATLAS: an infrastructure for global computing
EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
An Enabling Framework for Master-Worker Applications on the Computational Grid
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
DARX—A Framework For The Fault-Tolerant Support Of Agent Software
ISSRE '03 Proceedings of the 14th International Symposium on Software Reliability Engineering
Total order broadcast and multicast algorithms: Taxonomy and survey
ACM Computing Surveys (CSUR)
Filtering Failure Logs for a BlueGene/L Prototype
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Algorithmic skeletons meeting grids
Parallel Computing - Algorithmic skeletons
Adaptive and reliable parallel computing on networks of workstations
ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Parallelization of C# Programs Through Annotations
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Towards software component assembly language enhanced with workflows and skeletons
Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance
Stkm on Sca: A Unified Framework with Components, Workflows and Algorithmic Skeletons
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Using allopoietic agents in replicated software to respond to errors, faults, and attacks
Proceedings of the 48th Annual Southeast Regional Conference
Hi-index | 0.00 |
We introduce Co-Replication, a technique exploiting abstract properties of a computation to allow parallel replicas of a software module to cooperate, enhancing both the reliability and availability of the resulting component, and providing a flexible trade-off among the two properties. In Co-Replication a complete partial ordering is defined on the computation state. The formal expression of the state combination operation among replicas allows them to compute independently as a co-algorithm, and to exploit low-overhead, opportunistic strategies for spreading results and surviving to faults. Co-Replication suits structured parallel and component based programming, as it needs a high level description of the computation properties, and thus can ease exploitation ofnon fault-free, parallel platforms like large clusters and Grids. We describe the theoretical foundations of Co-Replication, and investigate the use of random gossiping strategies for the state combination. To show the applicability of the technique, we discuss the modelization of Master-Slave and task farm computations, and report test results over two applications.