An architectural framework for detecting process hangs/crashes

Authors:
Nithin Nakka;Giacinto Paolo Saggese;Zbigniew Kalbarczyk;Ravishankar K. Iyer
Affiliations:
Center for Reliable and High Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL;Center for Reliable and High Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL;Center for Reliable and High Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL;Center for Reliable and High Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Year:
2005

Citing 13
Cited 7

Reliable computer systems (2nd ed.): design and evaluation

Reliable computer systems (2nd ed.): design and evaluation
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

IEEE Transactions on Parallel and Distributed Systems
Performance estimation of embedded software with instruction cache modeling

ACM Transactions on Design Automation of Electronic Systems (TODAES)
On the Quality of Service of Failure Detectors

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Implementation and Performance Evaluation of an Adaptable Failure Detector

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Failure Detectors as First Class Objects

DOA '99 Proceedings of the International Symposium on Distributed Objects and Applications
Accelerated Heartbeat Protocols

ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications

IEEE Transactions on Software Engineering
Error Sensitivity of the Linux Kernel Executing on PowerPC G4 and Pentium 4 Processors

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks

A Compiler-Enabled Model- and Measurement-Driven Adaptation Environment for Dependability and Performance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
CUFFS: an instruction count based architectural framework for security of MPSoCs

Proceedings of the Conference on Design, Automation and Test in Europe
Ad hoc synchronization considered harmful

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
High-level synthesis of in-circuit assertions for verification, debugging, and timing analysis

International Journal of Reconfigurable Computing - Special issue on selected papers from the 17th reconfigurable architectures workshop (RAW2010)
CrashTest'ing SWAT: accurate, gate-level evaluation of symptom-based resiliency solutions

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the challenges faced in practical implementation of heartbeat-based process/crash and hang detection. We propose an in-processor hardware module to reduce error detection latency and instrumentation overhead. Three hardware techniques integrated into the main pipeline of a superscalar processor are presented. The techniques discussed in this work are: (i) Instruction Count Heartbeat (ICH), which detects process crashes and a class of hangs where the process exists but is not executing any instructions, (ii) Infinite Loop Hang Detector (ILHD), which captures process hangs in infinite execution of legitimate loops, and (iii) Sequential Code Hang Detector (SCHD), which detects process hangs in illegal loops. The proposed design has the following unique features: 1) operating system neutral detection techniques, 2) elimination of any instrumentation for detection of all application crashes and OS hangs, and 3) an automated and light-weight compile-time instrumentation methodology to detect all process hangs (including infinite loops), the detection being performed in the hardware module at runtime. The proposed techniques can support heartbeat protocols to detect operating system/process crashes and hangs in distributed systems. Evaluation of the techniques for hang detection show a low 1.6% performance overhead and 6% memory overhead for the instrumentation. The crash detection technique does not incur any performance overhead and has a latency of a few instructions.