Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

Authors:
Zbigniew T. Kalbarczyk;Ravishankar K. Iyer;Saurabh Bagchi;Keith Whisnant
Affiliations:
Univ. of Illinois at Urbana-Champaign, Urbana;Univ. of Illinois at Urbana-Champaign, Urbana;Univ. of Illinois at Urbana-Champaign, Urbana;Univ. of Illinois at Urbana-Champaign, Urbana
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1999

Citing 20
Cited 38

Understanding fault-tolerant distributed systems

Communications of the ACM
The process group approach to reliable distributed computing

Communications of the ACM
Software dependability in the operational phase

Software dependability in the operational phase
Totem: a fault-tolerant multicast group communication system

Communications of the ACM
The Transis approach to high availability cluster communication

Communications of the ACM
Distributing trust with the Rampart toolkit

Communications of the ACM
Horus: a flexible group communication system

Communications of the ACM
A Metaobject Architecture for Fault-Tolerant Distributed Systems: The FRIENDS Approach

IEEE Transactions on Computers
Fault-tolerance in the advanced automation system

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Building Secure and Reliable Network Applications

Building Secure and Reliable Network Applications
Reliable Distributed Computing with the ISIS Toolkit

Reliable Distributed Computing with the ISIS Toolkit
Piranha: A CORBA Tool For High Availability

Computer
Distributed Fault-Tolerant Real-Time Systems: The Mars Approach

IEEE Micro
Distributed Fault Tolerance: Lessons from Delta-4

IEEE Micro
TNet: A Reliable System Area Network

IEEE Micro
An Overview of the Arjuna Distributed Programming System

IEEE Software
ROAFTS: A Middleware Architecture for Real-Time Object-Oriented Adaptive Fault Tolerance Support

HASE '98 The 3rd IEEE International Symposium on High-Assurance Systems Engineering
Behavior of a Computer Based Interlocking System under Transient Hardware Faults

PRFTS '97 Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems
AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Using Application Specific Knowledge for Configuring Object Replicas

ICCDS '96 Proceedings of the 3rd International Conference on Configurable Distributed Systems

An Adaptive Algorithm for Tolerating Value Faults and Crash Failures

IEEE Transactions on Parallel and Distributed Systems
Containment units: a hierarchically composable architecture for adaptive systems

Proceedings of the 10th ACM SIGSOFT symposium on Foundations of software engineering
Containment units: a hierarchically composable architecture for adaptive systems

ACM SIGSOFT Software Engineering Notes
Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment

IEEE Transactions on Knowledge and Data Engineering
A Fault-Tolerant Distributed Vision System Architecture for Object Tracking in a Smart Room

ICVS '01 Proceedings of the Second International Workshop on Computer Vision Systems
Comparing Fail-Sailence Provided by Process Duplication versus Internal Error Detection for DHCP Server

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Implementing a CORBA-Based Architecture for Leveraging the Security Level of Existing Applications

On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002
Micro-Checkpointing: Checkpointing for Multithreaded Applications

IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
A control theory based framework for dynamic adaptable systems

Proceedings of the 2004 ACM symposium on Applied computing
The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications

IEEE Transactions on Software Engineering
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Hierarchical application aware error detection and recovery

Proceedings of the 41st annual Design Automation Conference
A system model for dynamically reconfigurable software

IBM Systems Journal
Effective Fault Treatment for Improving the Dependability of COTS and Legacy-Based Applications

IEEE Transactions on Dependable and Secure Computing
Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Application Fault Tolerance with Armor Middleware

IEEE Internet Computing
A UAU test and development environment based on dynamic system reconfiguration

WADS '05 Proceedings of the 2005 workshop on Architecting dependable systems
RTES demo system2004

ACM SIGBED Review - Special issue: The second workshop on high performance, fault adaptive, large scale embedded real-time systems (FALSE-II)
Startup comparison for message passing libraries with DTM on linux clusters

The Journal of Supercomputing
A Predictive Method for Providing Fault Tolerance in Multi-agent Systems

IAT '06 Proceedings of the IEEE/WIC/ACM international conference on Intelligent Agent Technology
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Understanding and dealing with operator mistakes in internet services

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Hierarchical Modeling and Analysis for Grid Service Reliability

IEEE Transactions on Computers
Fault Tolerance via Diversity for Off-the-Shelf Products: A Study with SQL Database Servers

IEEE Transactions on Dependable and Secure Computing
Predictive fault tolerance in multiagent systems: a plan-based replication approach

Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems
Adapting to intermittent faults in multicore systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Dynamic resource allocation heuristics for providing fault tolerance in multi-agent systems

Proceedings of the 2008 ACM symposium on Applied computing
Safety shell for specification-PEARL oriented UML real-time projects

Computer Languages, Systems and Structures
Adaptive Fault Tolerance for Scalable Cluster Computing in Space

International Journal of High Performance Computing Applications
Autonomic fault mitigation in embedded systems

Engineering Applications of Artificial Intelligence
DARX: a self-healing framework for agents

Proceedings of the 12th Monterey conference on Reliable systems on unreliable networked platforms
Towards reliable multi-agent systems: An adaptive replication mechanism

Multiagent and Grid Systems
Adaptive fault tolerance for many-core based space-borne computing

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Plan-based replication for fault-tolerant multi-agent systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Vigne: towards a self-healing grid operating system

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
An architectural framework for detecting process hangs/crashes

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Fault tolerance: case study

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents Chameleon, an adaptive infrastructure, which allows different levels of availability requirements to be simultaneously supported in a networked environment. Chameleon provides dependability through the use of special ARMORs驴Adaptive, Reconfigurable, and Mobile Objects for Reliability驴that control all operations in the Chameleon environment. Three broad classes of ARMORs are defined: 1) Managers oversee other ARMORs and recover from failures in their subordinates. 2) Daemons provide communication gateways to the ARMORs at the host node. They also make available a host's resources to the Chameleon environment. 3) Common ARMORs implement specific techniques for providing application-required dependability. Employing ARMORs, Chameleon makes available different fault-tolerant configurations and maintains run-time adaptation to changes in the availability requirements of an application. Flexible ARMOR architecture allows their composition to be reconfigured at run-time, i.e., the ARMORs may dynamically adapt to changing application requirements. In this paper, we describe ARMOR architecture, including ARMOR class hierarchy, basic building blocks, ARMOR composition, and use of ARMOR factories. We present how ARMORs can be reconfigured and reengineered and demonstrate how the architecture serves our objective of providing an adaptive software infrastructure. To our knowledge, Chameleon is one of the few real implementations which enables multiple fault tolerance strategies to exist in the same environment and supports fault-tolerant execution of substantially off-the-shelf applications via a software infrastructure only. Chameleon provides fault tolerance from the application's point of view as well as from the software infrastructure's point of view. To demonstrate the Chameleon capabilities, we have implemented a prototype infrastructure which provides set of ARMORs to initialize the environment and to support the dual and TMR application execution modes. Through this testbed environment, we measure the execution overhead and recovery times from failures in the user application, the Chameleon ARMORs, the hardware, and the operating system.