Identifying HPC codes via performance logs and machine learning

Authors:
Orianna DeMasi;Taghrid Samak;David H. Bailey
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, USA;Lawrence Berkeley National Laboratory, Berkeley, USA;Lawrence Berkeley National Laboratory, Berkeley, USA
Venue:
Proceedings of the first workshop on Changing landscapes in HPC security
Year:
2013

Citing 5
Cited 0

Communication characteristics of large-scale scientific applications for contemporary cluster architectures

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
In Defense of One-Vs-All Classification

The Journal of Machine Learning Research
Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybrid Interconnect

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A view of the parallel computing landscape

Communications of the ACM - A View of Parallel Computing
Network-theoretic classification of parallel computation patterns

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We aim here to leverage supervised learning to enable large-scale analysis of performance logs, in order to accurately classify code runs and understand the importance of different performance metrics. Previous work has demonstrated structured communication patterns in high performance codes. By categorizing these patterns, we can identify what code was executed. The ability to identify a code by its performance profile is useful for specializing HPC security systems and for identifying common optimizations for similar codes. Supervised machine learning is used on an extensive set of data of real user runs from a high performance computing center. We employ and modify a rule ensemble method to predict what code was run given a performance log. This naive method achieves greater than 93% accuracy. When modified to allow an "other class," accuracy increases to greater than 97%. This modification allows an anomalous run to be flagged as not belonging to a previously seen, or acceptable, code and provides additional latitude in monitoring what is run on supercomputing facilities. We conclude by interpreting the resulting rule model, as it tells us which components of a code are most distinctive and useful for identification.