IEEE Transactions on Computers
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers
IEEE Transactions on Software Engineering
Experimental assessment of parallel systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Case-based software reliability assessmentby fault injection unified procedures
Proceedings of the 2008 international workshop on Software Engineering in east and south europe
Injecting communication faults to experimentally validate java distributed applications
ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Hi-index | 0.00 |
Abstract: This paper addresses the problem of injection of faults in the communication system of disjoint memory parallel computers and presents fault injection results showing that 5% to 30% of the faults injected in the communication subsystem of a commercial parallel computer caused undetected errors that lead the application to generate erroneous results. All these cases correspond to situations in which it would be virtually impossible to detect that the benchmark output was erroneous, as the size of the results file was plausible and no system errors had been detected. This emphasizes the need for fault tolerant techniques in parallel systems in order to achieve confidence in the application results. This is especially true in massively parallel computers, as the probability of occurring faults increase with the number of processing nodes. Moreover, in disjoint memory computers, which is the most popular and scalable parallel architecture, the communication subsystem plays an important role, and is also very prone to errors. CSFI (Communication Software Fault Injector) is a versatile tool to inject communication faults in parallel computers. Faults injected with CSFI directly emulate communication faults and spurious messages generated by non fail-silent nodes by software, allowing the evaluation of the impact of faults in parallel systems and the assessment of fault tolerant techniques. The use of CSFI is nearly transparent to the target application as it only requires minor adaptations. Deterministic faults of different nature can be injected without user intervention and fault injection results are collected automatically by CSFI.