Reli: hardware/software checkpoint and recovery scheme for embedded processors

Authors:
Tuo Li;Roshan Ragel;Sri Parameswaran
Affiliations:
University of New South Wales, Sydney, Australia;University of Peradeniya, Peradeniya, Sri Lanka;University of New South Wales, Sydney, Australia
Venue:
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Year:
2012

Citing 22
Cited 3

The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Reliable computer systems (3rd ed.): design and evaluation

Reliable computer systems (3rd ed.): design and evaluation
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Fault Injection Techniques and Tools

Computer
IBM's S/390 G5 Microprocessor Design

IEEE Micro
A User-level Checkpointing Library for POSIX Threads Programs

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
An Architectural Framework for Providing Reliability and Security Support

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Application-level checkpointing for shared memory programs

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Rapid Embedded Hardware/Software System Generation

VLSID '05 Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design
Design at the end of the silicon roadmap

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
IMPRES: integrated monitoring for processor reliability and security

Proceedings of the 43rd annual Design Automation Conference
SWICH: A Prototype for Efficient Cache-Level Checkpointing and Rollback

IEEE Micro
Cost-efficient soft error protection for embedded microprocessors

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
An OS-level Framework for Providing Application-Aware Reliability

PRDC '06 Proceedings of the 12th Pacific Rim International Symposium on Dependable Computing
A Processor Generation Method from Instruction Behavior Description Based on Specification of Pipeline Stages and Functional Units

ASP-DAC '07 Proceedings of the 2007 Asia and South Pacific Design Automation Conference
Challenges and Solutions for Late- and Post-Silicon Design

IEEE Design & Test
Processor Description Languages

Processor Description Languages
Architectural enhancement and system software support for program code integrity monitoring in application-specific instruction-set processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

CSER: HW/SW configurable soft-error resiliency for application specific instruction-set processors

Proceedings of the Conference on Design, Automation and Test in Europe
RASTER: runtime adaptive spatial/temporal error resiliency for embedded processors

Proceedings of the 50th Annual Design Automation Conference
DHASER: dynamic heterogeneous adaptation for soft-error resiliency in ASIP-based multi-core systems

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

Checkpoint and Recovery (CR) allows computer systems to operate correctly even when compromised by transient faults. While many software systems and hardware systems for CR do exist, they are usually either too large, require major modifications to the software, too slow, or require extensive modifications to the caching schemes. In this paper, we propose a novel error-recovery management scheme, which is based upon re-engineering the instruction set. We take the native instruction set of the processor and enhance the microinstructions with additional micro-operations which enable checkpointing. The recovery mechanism is implemented by three custom instructions, which recover the registers which were changed, the data memory values which were changed and the special registers (PC, status registers etc.) which were changed. Our checkpointing storage is changed according to the benchmark executed. Results show that our method degrades performance by just 1.45% under fault free conditions, and incurs area overhead of 45% on average and 79% in the worst case. The recovery takes just 62 clock cycles (worst case) in the examples which we examined.