Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
A security architecture for computational grids
CCS '98 Proceedings of the 5th ACM conference on Computer and communications security
Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A grid-enabled MPI: message passing in heterogeneous distributed computing systems
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-CM: A Communication Library Design for a P2P MPI Implementation
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Phoenix: a parallel programming model for accommodating dynamically joining/leaving resources
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
A network-failure-tolerant message-passing system for terascale clusters
International Journal of Parallel Programming
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Future Generation Computer Systems - Special issue: P2P computing and interaction with grids
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
Building and Using a Fault-Tolerant MPI Implementation
International Journal of High Performance Computing Applications
A Simple MPI Process Swapping Architecture for Iterative Applications
International Journal of High Performance Computing Applications
A channel memory based fault tolerance for MPI applications
Future Generation Computer Systems - Special issue: Parallel computing technologies
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Faster Checkpointing and Recovery Algorithm with a Hierarchical Storage Approach
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Scalable, fault tolerant membership for MPI tasks on HPC systems
Proceedings of the 20th annual international conference on Supercomputing
Message passing over windows-based desktop grids
Proceedings of the 4th international workshop on Middleware for grid computing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Declarative failure recovery for sensor networks
Proceedings of the 6th international conference on Aspect-oriented software development
Worldwide computing: Adaptive middleware and programming technology for dynamic Grid environments
Scientific Programming - Dynamic Grids and Worldwide Computing
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Coordinated checkpoint versus message log for fault tolerant MPI
International Journal of High Performance Computing and Networking
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Experimental Assessment of the Practicality of a Fault-Tolerant System
SOFSEM '07 Proceedings of the 33rd conference on Current Trends in Theory and Practice of Computer Science
Workflow Global Computing with YML
GRID '06 Proceedings of the 7th IEEE/ACM International Conference on Grid Computing
International Journal of High Performance Computing Applications
MPISec I/O: Providing Data Confidentiality in MPI-I/O
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
On correlated availability in Internet-distributed systems
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
International Journal of High Performance Computing Applications
A fault-tolerant strategy for virtualized HPC clusters
The Journal of Supercomputing
A Channel Memory based fault tolerance for MPI applications
Future Generation Computer Systems - Special issue: Parallel computing technologies
Application execution management on the InteGrade opportunistic grid middleware
Journal of Parallel and Distributed Computing
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
Team-Based Message Logging: Preliminary Results
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Piccolo: building fast, distributed programs with partitioned tables
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Mobile multimedia for multiuser environments
Journal of Mobile Multimedia
An intelligent management of fault tolerance in cluster using RADICMPI
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Can MPI be used for persistent parallel services?
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
SHIELD: a fault-tolerant MPI for an infiniband cluster
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Open MPI: a flexible high performance MPI
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
An architecture for reconfigurable iterative MPI applications in dynamic environments
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Self-refined fault tolerance in HPC using dynamic dependent process groups
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Scalable fault tolerant MPI: extending the recovery algorithm
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Migol: a fault-tolerant service framework for MPI applications in the grid
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
A novel checkpoint mechanism based on job progress description for computational grid
ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
Parallel fault tolerant algorithms for parabolic problems
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
An integrated architecture for qos-enable router and grid-oriented supercomputer
ICCNMC'05 Proceedings of the Third international conference on Networking and Mobile Computing
Estimation of MPI application performance on volunteer environments
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
Independent checkpointing in a heterogeneous grid environment
Future Generation Computer Systems
Tuple switching network-When slower may be better
Journal of Parallel and Distributed Computing
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
enhancing fault-tolerance of large-scale MPI scientific applications
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Open issues in MPI implementation
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Low cost self-healing in MPI applications
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
The viability of using compression to decrease message log sizes
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Hi-index | 0.00 |
Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes.We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/rollback and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes.To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.