SAFER: System-level Architecture for Failure Evasion in Real-time Applications

  • Authors:
  • Junsung Kim;Gaurav Bhatia;Ragunathan (Raj) Rajkumar;Markus Jochim

  • Affiliations:
  • -;-;-;-

  • Venue:
  • RTSS '12 Proceedings of the 2012 IEEE 33rd Real-Time Systems Symposium
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent trends towards increasing complexity in distributed embedded real-time systems pose challenges in designing and implementing a reliable system such as a self-driving car. The conventional way of improving reliability is to use redundant hardware to replicate the whole (sub)system. Although hardware replication has been widely deployed in hard real-time systems such as avionics, space shuttles and nuclear power plants, it is significantly less attractive to many applications because the amount of necessary hardware multiplies as the size of the system increases. The growing needs of flexible system design are also not consistent with hardware replication techniques. To address the needs of dependability through redundancy operating in real-time, we propose a layer called SAFER(System-level Architecture for Failure Evasion in Real-time applications) to incorporate configurable task-level fault-tolerance features to tolerate fail-stop processor and task failures for distributed embedded real-time systems. To detect such failures, SAFER monitors the health status and state information of each task and broadcasts the information. When a failure is detected using either time-based failure detection or event-based failure detection, SAFER reconfigures the system to retain the functionality of the whole system. We provide a formal analysis of the worst-case timing behaviors of SAFER features. We also describe the modeling of a system equipped with SAFER to analyze timing characteristics through a model-based design tool called SysWeaver. SAFER has been implemented on Ubuntu 10.04 LTS and deployed on Boss, an award-winning autonomous vehicle developed at Carnegie Mellon University. We show various measurements using simulation scenarios used during the 2007 DARPA Urban Challenge. Finally, we present a case study of failure recovery by SAFER when node failures are injected.