Lightweight monitoring of MPI programs in real time: Research Articles

  • Authors:
  • German Florez;Zhen Liu;Susan M. Bridges;Anthony Skjellum;Rayford B. Vaughn

  • Affiliations:
  • Center for Computer Security Research and High Performance Computer Laboratory, Department of Computer Science and Engineering, Mississippi State University, MS 39762-9637, U.S.A.;Center for Computer Security Research and High Performance Computer Laboratory, Department of Computer Science and Engineering, Mississippi State University, MS 39762-9637, U.S.A.;Center for Computer Security Research and High Performance Computer Laboratory, Department of Computer Science and Engineering, Mississippi State University, MS 39762-9637, U.S.A.;Ctr. for Comp. Sec. Res. and High Perf. Comp. Lab., Dept. of Comp. Sci. and Eng., Mississippi State Univ. and Dept. of Comp. and Info. Sci., University of Alabama, Birmingham, AL, U.S.A.;Center for Computer Security Research and High Performance Computer Laboratory, Department of Computer Science and Engineering, Mississippi State University, MS 39762-9637, U.S.A.

  • Venue:
  • Concurrency and Computation: Practice & Experience
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Current technologies allow efficient data collection by several sensors to determine an overall evaluation of the status of a cluster. However, no previous work of which we are aware analyzes the behavior of the parallel programs themselves in real time. In this paper, we perform a comparison of different artificial intelligence techniques that can be used to implement a lightweight monitoring and analysis system for parallel applications on a cluster of Linux workstations. We study the accuracy and performance of deterministic and stochastic algorithms when we observe the flow of both library-function and operating-system calls of parallel programs written with C and MPI. We demonstrate that monitoring of MPI programs can be achieved with high accuracy and in some cases with a false-positive rate near 0% in real time, and we show that the added computational load on each node is small. As an example, the monitoring of function calls using a hidden Markov model generates less than 5% overhead. The proposed system is able to automatically detect deviations of a process from its expected behavior in any node of the cluster, and thus it can be used as an anomaly detector, for performance monitoring to complement other systems or as a debugging tool. Copyright © 2005 John Wiley & Sons, Ltd.