Framework for enabling system understanding

Authors:
J. Brandt;F. Chen;A. Gentile;Chokchai (Box) Leangsuksun;J. Mayo;P. Pebay;D. Roe;N. Taerat;D. Thompson;M. Wong
Affiliations:
Sandia National Laboratories, Livermore, CA;Sandia National Laboratories, Livermore, CA;Sandia National Laboratories, Livermore, CA;Louisiana Tech University, Ruston, LA;Sandia National Laboratories, Livermore, CA;Sandia National Laboratories, Livermore, CA;Sandia National Laboratories, Livermore, CA;Louisiana Tech University, Ruston, LA;Sandia National Laboratories, Livermore, CA;Sandia National Laboratories, Livermore, CA
Venue:
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Year:
2011

Citing 4
Cited 0

Anomaly localization in large-scale clusters

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Proceedings of the 2009 workshop on Resiliency in high performance
Bridging the gaps: joining information sources with Splunk

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Toward Automated Anomaly Identification in Large-Scale Systems

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Building the effective HPC resilience mechanisms required for viability of next generation supercomputers will require in depth understanding of system and component behaviors. Our goal is to build an integrated framework for high fidelity long term information storage, historic and run-time analysis, algorithmic and visual information exploration to enable system understanding, timely failure detection/prediction, and triggering of appropriate response to failure situations. Since it is unknown what information is relevant and since potentially relevant data may be expressed in a variety of forms (e.g., numeric, textual), this framework must provide capabilities to process different forms of data and also support the integration of new data, data sources, and analysis capabilities. Further, in order to ensure ease of use as capabilities and data sources expand, it must also provide interactivity between its elements. This paper describes our integration of the capabilities mentioned above into our OVIS tool.