Logging and Recovery in Adaptive Software Distributed Shared Memory Systems

Authors:
Angkul Kongmunvattana;Nian-Feng Tzeng
Affiliations:
-;-
Venue:
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Year:
1999

Citing 14
Cited 3

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Recoverable Distributed Shared Virtual Memory

IEEE Transactions on Computers
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Evaluation of release consistent software distributed shared memory on emerging network technology

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
TreadMarks: Shared Memory Computing on Networks of Workstations

Computer
Lightweight logging for lazy release consistent distributed shared memory

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Shared Memory Consistency Models: A Tutorial

Computer
Lazy Logging and Prefetch-Based Crash Recovery in Software Distributed Shared Memory Systems

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Software DSM Protocols that Adapt between Single Writer and Multiple Writer

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Efficiently Adapting to Sharing Patterns in Software DSMs

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An Efficient Logging Scheme for Lazy Release Consistent Distributed Shared Memory Systems

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium

An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers

Journal of Experimental Algorithmics (JEA)
Integrating coordinated checkpointing and recovery mechanisms into DSM synchronization barriers

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. As in any distributed system, however, the probability of software DSM failures increases as the system size grows. This paper presents a new, efficient logging protocol for adaptive software DSM (ADSM), called adaptive logging (AL). It is suitable for both coordinated and independent checkpointing since it speeds up the recovery process and eliminates the unbounded rollback problem associated with independent checkpointing. By leveraging the existing coherence data maintained by ADSM, our AL protocol adapts to log only unrecoverable data (which cannot be recreated or retrieved after a failure) necessary for correct recovery, reducing both the number of messages logged and the amount of logged data.We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our AL protocol against the previous message logging (ML) protocol by implementing both protocols in TreadMarks-based ADSM. The experimental results show that our AL protocol consistently outperforms the ML protocol: Our protocol increases the execution time slightly by 2% to 10% during failure-free execution, while the ML protocol lengthens the execution time by many folds due to its larger log size and higher number of messages logged. Our AL-based recovery also outperforms ML-based recovery by 9% to 17% under parallel application examined.