ATM user-network interface (UNI) specification version 3.1
ATM user-network interface (UNI) specification version 3.1
IBM experiments in soft fails in computer electronics (1978–1994)
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Fault-tolerant computer system design
Fault-tolerant computer system design
Software fault tolerance techniques and implementation
Software fault tolerance techniques and implementation
EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Self-testing software probe system for failure detection and diagnosis
CASCON '94 Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research
Dependability Analysis Of A Commercial High-Speed Network
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
IBM PowerNP network processor: Hardware, software, and applications
IBM Journal of Research and Development
Detecting Soft Errors by a Purely Software Approach: Method, Tools and Experimental Results
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe: Designers' Forum - Volume 2
The Soft Error Problem: An Architectural Perspective
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
An Approach to Concurrent Control Flow Checking
IEEE Transactions on Software Engineering
Applying Safety Goals to a New Intensive Care Workstation System
SAFECOMP '08 Proceedings of the 27th international conference on Computer Safety, Reliability, and Security
Securing the data path of next-generation router systems
Computer Communications
Hi-index | 0.00 |
Emerging network technologies have complex network interfaces that have renewed concerns about network reliability. In this paper, we present an effective low- verhead fault tolerance technique to recover from network interface failures. Failure detection is based on a software watchdog timer that detects network processor hangs and a self- esting scheme that detects interface failures other than processor hangs. The proposed self-testing scheme achieves failure detection by periodically directing the control flow to go through only active software modules in order to detect errors that affect instructions in the local memory of the network interface. Our failure recovery is achieved by restoring the state of the network interface using a small backup copy containing just the right amount of information required for complete recovery. The paper shows how this technique can be made to minimize the performance impact to the host system and be completely transparent to the user.