Monitors, messages, and clusters: the p4 parallel programming system
Parallel Computing - Special issue: message passing interfaces
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A Case for NOW (Networks of Workstations)
IEEE Micro
NXLib - A Parallel Programming Environment for Workstation Clusters
PARLE '94 Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages Europe
SFT: a consistent checkpointing algorithm with shorter freezing time
ACM SIGOPS Operating Systems Review
The Journal of Supercomputing
Exploiting Fine-Grained Idle Periods in Networks of Workstations
IEEE Transactions on Parallel and Distributed Systems
The implementation of dynamite: an environment for migrating PVM tasks
ACM SIGOPS Operating Systems Review
A checkpointing strategy for scalable recovery on distributed parallel systems
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Proceedings of the 8th annual international conference on Mobile computing and networking
Fault Tolerant MPI for the HARNESS Meta-computing System
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
TH-MPI: OS Kernel Integrated Fault Tolerant MPI
Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Solving Engineering Applications with LAMGAC over MPI-2
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
HARNESS fault tolerant MPI design, usage and performance issues
Future Generation Computer Systems - Grid computing: Towards a new computing infrastructure
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Optimizing the migration of virtual computers
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
IEEE Transactions on Software Engineering
A network-failure-tolerant message-passing system for terascale clusters
International Journal of Parallel Programming
Application-level checkpointing for shared memory programs
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Checkpoint and Restart for Distributed Components in XCAT3
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Combining FT-MPI with H2O: Fault-Tolerant MPI Across Administrative Boundaries
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Optimizing Checkpoint Sizes in the C3 System
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Optimizing the migration of virtual computers
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters
International Journal of High Performance Computing Applications
Design and implementation of a single system image operating system for ad hoc networks
Proceedings of the 3rd international conference on Mobile systems, applications, and services
Turducken: hierarchical power management for mobile devices
Proceedings of the 3rd international conference on Mobile systems, applications, and services
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
Building and Using a Fault-Tolerant MPI Implementation
International Journal of High Performance Computing Applications
A Simple MPI Process Swapping Architecture for Iterative Applications
International Journal of High Performance Computing Applications
A channel memory based fault tolerance for MPI applications
Future Generation Computer Systems - Special issue: Parallel computing technologies
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Faster Checkpointing and Recovery Algorithm with a Hierarchical Storage Approach
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Mobile MPI programs in computational grids
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
HPC-Colony: services and interfaces for very large systems
ACM SIGOPS Operating Systems Review
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++
ACM SIGOPS Operating Systems Review
Experimental evaluation of application-level checkpointing for OpenMP programs
Proceedings of the 20th annual international conference on Supercomputing
Scalable, fault tolerant membership for MPI tasks on HPC systems
Proceedings of the 20th annual international conference on Supercomputing
Supporting dynamic migration in tightly coupled grid applications
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Malleable applications for scalable high performance computing
Cluster Computing
A serialization based approach for strong mobility of shared object
Proceedings of the 5th international symposium on Principles and practice of programming in Java
Towards highly available and scalable high performance clusters
Journal of Computer and System Sciences
Migol: A fault-tolerant service framework for MPI applications in the grid
Future Generation Computer Systems
Information Assurance: Dependability and Security in Networked Systems
Information Assurance: Dependability and Security in Networked Systems
Coordinated checkpoint versus message log for fault tolerant MPI
International Journal of High Performance Computing and Networking
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing
CprFS: a user-level file system to support consistent file states for checkpoint and restart
Proceedings of the 22nd annual international conference on Supercomputing
Remus: high availability via asynchronous virtual machine replication
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning
Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Interconnect agnostic checkpoint/restart in open MPI
Proceedings of the 18th ACM international symposium on High performance distributed computing
International Journal of High Performance Computing Applications
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Post-copy live migration of virtual machines
ACM SIGOPS Operating Systems Review
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A Channel Memory based fault tolerance for MPI applications
Future Generation Computer Systems - Special issue: Parallel computing technologies
DAGMap: efficient and dependable scheduling of DAG workflow job in Grid
The Journal of Supercomputing
International Journal of Parallel Programming
Managing performance of aging applications via synchronized replica rejuvenation
DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
A novel fault-tolerant parallel algorithm
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Grid computing: experiment management, tool integration, and scientific workflows
Grid computing: experiment management, tool integration, and scientific workflows
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
A framework for process migration in software DSM environments
EURO-PDP'00 Proceedings of the 8th Euromicro conference on Parallel and distributed processing
Recent advances in checkpoint/recovery systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Towards building a highly-available cluster based model for high performance computing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A fault-tolerant parallel text searching technique on a cluster of workstations
ACOS'06 Proceedings of the 5th WSEAS international conference on Applied computer science
Architectures & infrastructure
Service research challenges and solutions for the future internet
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Fault tolerance in the mobile environment
Journal of Mobile Multimedia
A Robust and Efficient Message Passing Library for Volunteer Computing Environments
Journal of Grid Computing
SpotMPI: a framework for auction-based HPC computing using amazon spot instances
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
FT-MPI, fault-tolerant metacomputing and generic name services: a case study
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Extended mpijava for distributed checkpointing and recovery
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
SHIELD: a fault-tolerant MPI for an infiniband cluster
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
An architecture for reconfigurable iterative MPI applications in dynamic environments
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Self-refined fault tolerance in HPC using dynamic dependent process groups
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Scalable fault tolerant MPI: extending the recovery algorithm
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Migol: a fault-tolerant service framework for MPI applications in the grid
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Applicability of generic naming services and fault-tolerant metacomputing with FT-MPI
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Performance evaluation of consistent recovery protocols using MPICH-GF
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Parallel checkpointing on a grid-enabled java platform
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Transparent checkpointing for applications with graphical user interfaces
ISAS'06 Proceedings of the Third international conference on Service Availability
Independent checkpointing in a heterogeneous grid environment
Future Generation Computer Systems
X10-FT: transparent fault tolerance for APGAS language and runtime
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Mobile cloud computing framework for a pervasive and ubiquitous environment
The Journal of Supercomputing
Escape capsule: explicit state is robust and scalable
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds
ACM SIGOPS Operating Systems Review
The Journal of Supercomputing
Evaluating energy savings for checkpoint/restart
E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
A policy-based approach for strong mobility of composed Web services
Service Oriented Computing and Applications
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Hi-index | 0.00 |
Checkpointing of parallel applications can be used as the core technology to provide process migration. Both, checkpointing and migration, are an important issue for parallel applications on networks of workstations. The CoCheck environment which we present in this paper introduces a new approach to provide checkpointing and migration for parallel applications. In difference to existing systems CoCheck rather sits on top of the message passing library than inside and achieves consistency at a level above the message passing system. It uses an existing single process checkpointer which is available for a wide range of systems. Hence, CoCheck can be easily adapted to both, different message passing systems and new machines.