Supporting component-based failover units in middleware for distributed real-time and embedded systems

  • Authors:
  • Friedhelm Wolf;Jaiganesh Balasubramanian;Sumant Tambe;Aniruddha Gokhale;Douglas C. Schmidt

  • Affiliations:
  • Department of EECS, Vanderbilt University, Nashville, TN 37235, USA;Department of EECS, Vanderbilt University, Nashville, TN 37235, USA;Department of EECS, Vanderbilt University, Nashville, TN 37235, USA;Department of EECS, Vanderbilt University, Nashville, TN 37235, USA;Department of EECS, Vanderbilt University, Nashville, TN 37235, USA

  • Venue:
  • Journal of Systems Architecture: the EUROMICRO Journal
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

Although component middleware is increasingly used to develop distributed, real-time and embedded (DRE) systems, it poses new fault-tolerance challenges, such as the need for efficient synchronization of internal component state, failure correlation across groups of components, and configuration of fault-tolerance properties at the component granularity level. This paper makes three contributions to R&D on component-based fault-tolerance. First, it describes the COmponent Replication based on Failover Units (CORFU) component middleware, which provides fail-stop behavior and fault correlation across groups of components treated as an atomic unit in DRE systems. Second, it describes how CORFU's Components with HEterogeneous State Synchronization (CHESS) module provides mechanisms for real-time aware state transfer and synchronization in CORFU. Third, we empirically evaluate the client failover and group shutdown capabilities of CORFU and its CHESS module and compare/contrast it with existing object-oriented fault-tolerance methods. Our results show that component middleware (1) has acceptable fault-tolerance performance for DRE systems, (2) allows timely recovery while considering failure location, size, and functional topology of the group, and finally (3) eases the burden of application development by providing middleware support for fault-tolerance at the component level.