High performance software coherence for current and future architectures
Journal of Parallel and Distributed Computing - Special issue on distributed shared memory systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Hiding communication latency and coherence overhead in software DSMs
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Lightweight logging for lazy release consistent distributed shared memory
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
A Survey of Recoverable Distributed Shared Virtual Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Cashmere-2L: software coherent shared memory on a clustered remote-write network
Proceedings of the sixteenth ACM symposium on Operating systems principles
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
The Virtual Interface Architecture
IEEE Micro
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Coherence-Centric Logging and Recovery for Home-Based Software Distributed Shared Memory
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An Efficient Logging Scheme for Lazy Release Consistent Distributed Shared Memory Systems
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Lightweight logging and recovery for distributed shared memory over virtual interface architecture
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Accountability and Q-Accountable Logging in Wireless Networks
Wireless Personal Communications: An International Journal
Hi-index | 0.00 |
A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network. For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution, can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing.