Fault Recovery Mechanism for Multiprocessor Servers

Authors:
Yoshio Masubuchi;Satoshi Hoshina;Tomofumi Shimada;Hideaki Hirayama;Nobuhiro Kato
Affiliations:
-;-;-;-;-
Venue:
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Year:
1997

Citing 0
Cited 4

ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
CprFS: a user-level file system to support consistent file states for checkpoint and restart

Proceedings of the 22nd annual international conference on Supercomputing
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Achieving higher reliability in open server computer systems with low cost has been an increasing interest recently. To satisfy this general demand, we propose a new fault recovery mechanism. We extended the recovery cache scheme to adapt to state-of-the-art multiprocessor server computer systems, and built a system level fault recovery mechanism. It enables the system to recover from most intermittent hardware errors without rebooting the system. Furthermore, faulty processors can be isolated dynamically, and not only hardware errors but also many of operating system panics caused by unanticipated software errors can be recovered. The fault recovery mechanism is implemented with the ``add-on'' hardware module and controlling software module and fully transparent to application programs. Thus no modification is required to the basic hardware and binary compatibility is maintained, which is mandatory for open systems. System performance was evaluated using TPC-C benchmark. We also built an experimental system with prototype hardware.