Assessing the effects of communication faults on parallel applications

  • Authors:
  • Affiliations:
  • Venue:
  • IPDS '95 Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability Symposium
  • Year:
  • 1995

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: This paper addresses the problem of injection of faults in the communication system of disjoint memory parallel computers and presents fault injection results showing that 5% to 30% of the faults injected in the communication subsystem of a commercial parallel computer caused undetected errors that lead the application to generate erroneous results. All these cases correspond to situations in which it would be virtually impossible to detect that the benchmark output was erroneous, as the size of the results file was plausible and no system errors had been detected. This emphasizes the need for fault tolerant techniques in parallel systems in order to achieve confidence in the application results. This is especially true in massively parallel computers, as the probability of occurring faults increase with the number of processing nodes. Moreover, in disjoint memory computers, which is the most popular and scalable parallel architecture, the communication subsystem plays an important role, and is also very prone to errors. CSFI (Communication Software Fault Injector) is a versatile tool to inject communication faults in parallel computers. Faults injected with CSFI directly emulate communication faults and spurious messages generated by non fail-silent nodes by software, allowing the evaluation of the impact of faults in parallel systems and the assessment of fault tolerant techniques. The use of CSFI is nearly transparent to the target application as it only requires minor adaptations. Deterministic faults of different nature can be injected without user intervention and fault injection results are collected automatically by CSFI.