Detecting Anomalies in High-Performance Parallel Programs

  • Authors:
  • German Florez;Zhen Liu;Susan Bridges;Rayford Vaughn;Anthony Skjellum

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Message Passing Interface (MPI) is an effectiveprogramming technique for implementing parallelprograms for distributed computation. As theseapplications run, a number of different types ofirregularities can occur including those that result fromintrusions, user misbehavior, corrupted data, deadlocks orfailure of cluster components. In this paper, we perform acomparison of different artificial intelligence (AI)techniques that can be used to implement a lightweightmonitoring and detection system for parallel applicationson a cluster of Linux workstations. We study the accuracyand performance of deterministic and stochasticalgorithms when we observe the flow of function libraryand OS system calls of parallel programs written with MPI.We demonstrate that monitoring of MPI programs can beachieved with high accuracy and in some cases with a 0%false positive rate in real-time, and we show that the addedcomputational load on each node is small. Finally wedemonstrate that simple deterministic methods performpoorly when the program flow grows in size and variety,and that more complex methods are required.