Lightweight monitoring of MPI programs in real time: Research Articles

Authors:
German Florez;Zhen Liu;Susan M. Bridges;Anthony Skjellum;Rayford B. Vaughn
Affiliations:
Center for Computer Security Research and High Performance Computer Laboratory, Department of Computer Science and Engineering, Mississippi State University, MS 39762-9637, U.S.A.;Center for Computer Security Research and High Performance Computer Laboratory, Department of Computer Science and Engineering, Mississippi State University, MS 39762-9637, U.S.A.;Center for Computer Security Research and High Performance Computer Laboratory, Department of Computer Science and Engineering, Mississippi State University, MS 39762-9637, U.S.A.;Ctr. for Comp. Sec. Res. and High Perf. Comp. Lab., Dept. of Comp. Sci. and Eng., Mississippi State Univ. and Dept. of Comp. and Info. Sci., University of Alabama, Birmingham, AL, U.S.A.;Center for Computer Security Research and High Performance Computer Laboratory, Department of Computer Science and Engineering, Mississippi State University, MS 39762-9637, U.S.A.
Venue:
Concurrency and Computation: Practice & Experience
Year:
2005

Citing 0
Cited 5

Incremental estimation of discrete hidden Markov models based on a new backward procedure

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Monitoring MPI programs for performance characterization and management control

Proceedings of the 2010 ACM Symposium on Applied Computing
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
Efficient modeling of discrete events for anomaly detection using hidden markov models

ISC'05 Proceedings of the 8th international conference on Information Security
Multiclass classification of distributed memory parallel computations

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current technologies allow efficient data collection by several sensors to determine an overall evaluation of the status of a cluster. However, no previous work of which we are aware analyzes the behavior of the parallel programs themselves in real time. In this paper, we perform a comparison of different artificial intelligence techniques that can be used to implement a lightweight monitoring and analysis system for parallel applications on a cluster of Linux workstations. We study the accuracy and performance of deterministic and stochastic algorithms when we observe the flow of both library-function and operating-system calls of parallel programs written with C and MPI. We demonstrate that monitoring of MPI programs can be achieved with high accuracy and in some cases with a false-positive rate near 0% in real time, and we show that the added computational load on each node is small. As an example, the monitoring of function calls using a hidden Markov model generates less than 5% overhead. The proposed system is able to automatically detect deviations of a process from its expected behavior in any node of the cluster, and thus it can be used as an anomaly detector, for performance monitoring to complement other systems or as a debugging tool. Copyright © 2005 John Wiley & Sons, Ltd.