High availability in a real-time system

Authors:
Carlos Almeida;Brad Glade;Keith Marzullo;Robbert van Renesse
Affiliations:
-;-;-;-
Venue:
EW 5 Proceedings of the 5th workshop on ACM SIGOPS European workshop: Models and paradigms for distributed systems structuring
Year:
1992

Citing 8
Cited 0

ARTS: a distributed real-time kernel

ACM SIGOPS Operating Systems Review
The Spring kernel: a new paradigm for real-time operating systems

ACM SIGOPS Operating Systems Review
The real-time operating system of MARS

ACM SIGOPS Operating Systems Review
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
The ISIS project: real experience with a fault tolerant programming system

ACM SIGOPS Operating Systems Review
Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment

Journal of the ACM (JACM)
Priority Inheritance Protocols: An Approach to Real-Time Synchronization

IEEE Transactions on Computers
Reliable Multicast between Micro-Kernels

Proceedings of the Workshop on Micro-kernels and Other Kernel Architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

The area of building embedded real-time systems is one inwhich the applications being designed are more advanced than theavailable underlying system support. Examples of such applicationscan be found in several fields, including robot control, avionics,and plant control systems. These systems all have hard real-timerequirements: if a deadline is missed, then the result iscatastrophic. Furthermore, such deadlines must often be met even inthe face of bounded processor or network failures. Yet, theprinciples for building such systems are still being developed andthe availability of systems supporting these principles is verylimited.One of the most important characteristics required by areal-time system is predictability, and predictability canbe met in part by ensuring that all timing constraints are met. Inorder to meet timing constraints, the worst case execution must becomputable. Hence, all actions need to be time bounded in order tocompute the cost of a given thread, and a scheduling policy must beused that guarantees resource contention does not cause deadlinesto be missed [LL73, SRL90].Several recent research projects have addressed the problem ofpredictability both in the context of centralized and distributedsystems, including ARTS [TM89], RT-Mach [TNR90], MARS [DRSK89], andSpring [SR87, SR89]. These projects are based on real-timescheduling algorithms, and usually also include tools for theoff-line development of pre-defined schedules. The issue ofpredictable operation in the face of crashes and network failures,however, has not been as well addressed.Failures are masked by using redundancy. For example, in adistributed system the failure of a given process can be masked byreplicating the process on several different machines. By doing so,the failure of one replica (caused by the crash of the machine, forexample), does not imply a failure in the service: the otherreplicas can still provide the desired service [Sch90].Even ignoring predictability, the development of fault-tolerantapplications can be a complex task when the programmer does nothave supporting software tools. At Cornell, we have developed theISIS toolkit that supplies a group programming paradigm forbuilding fault-tolerant programs [BJS87, BC91]. However, thecurrent version of this system is not suitable for buildingreal-time programs. ISIS runs on top of Unix and contains noscheduling support for writing predictable real-timeapplications.Our goal is to create an environment that supports thedevelopment of hard real-time systems even in the face of resourceloss. Corto, the system we are building, will support thebasic programming abstractions of ISIS; namely, ordered delivery ofmessages to groups of processes and agreement on membership. Cortowill also support the predictable scheduling of processes andcommunication that systems like ARTS and RT-Mach provide.We are finding it challenging to integrate these two goals. ISISsupports a model of programming called virtual synchrony inwhich events such as failure, recovery and message delivery aretotally ordered. This abstraction is fundamental to ISIS; becauseof virtual synchrony, building applications that maintaindistributed state in the face of changing resources becomes verystraightforward. However, the implementation of virtual synchronyis done by a kind of distributed scheduler which must be madepredictable. Hence, implementing Corto is not just running ISIS ontop of a real-time kernel.Our initial approach is to build a suite of basic mechanisms,described below, that support a small set of real-timeapplications. We are implementing these mechanisms on top of theISIS transport layer (MUTS [vRBC+92]) running on astand-alone Unix system with minimal terminal support and our ownscheduling. While such a system will not be completely hardreal-time, this version will help us refine the right set ofmechanisms needed for highly available real-time applications. Wewill then move the system to a kernel that supports hard real-timescheduling.