Tolerating node failures in cache only memory architectures

  • Authors:
  • A. Gefflaut;C. Morin;M. Banâtre

  • Affiliations:
  • Campus Universitaire de Beaulieu, 35042 Rennes Cedex - France;Campus Universitaire de Beaulieu, 35042 Rennes Cedex - France;Campus Universitaire de Beaulieu, 35042 Rennes Cedex - France

  • Venue:
  • Proceedings of the 1994 ACM/IEEE conference on Supercomputing
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

COMAs (Cache Only Memory Architectures) are an interesting class of large scale shared memory multiprocessors. They extend the concepts of cache memories and shared virtual memory by using the local memories of the nodes as large caches for a single shared address space. Due to their large number of components, these architectures are particularly susceptible to hardware failures and so fault tolerance mechanisms have to be introduced to ensure a high availability. In this paper, we propose an implementation of backward error recovery in a COMA which minimizes performance degradation and requires little hardware modifications. This implementation uses the features of a COMA to implement a stable storage abstraction using the standard memories of the architecture. Recovery data are replicated and mixed with current data in node memories both of which are managed in a transparent way using an extended coherence protocol.