What Supercomputers Say: A Study of Five System Logs

Authors:
Adam Oliner;Jon Stearley
Affiliations:
Stanford University, USA;Sandia National Laboratories, USA
Venue:
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Year:
2007

Citing 0
Cited 37

An automated approach for abstracting execution logs to execution events

Journal of Software Maintenance and Evolution: Research and Practice - Special Issue on Program Comprehension through Dynamic Analysis (PCODA)
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Fault injection framework for system resilience evaluation: fake faults for finding future failures

Proceedings of the 2009 workshop on Resiliency in high performance
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Diagnosis of recurrent faults using log files

CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Efficiently extracting operational profiles from execution logs using suffix arrays

ISSRE'09 Proceedings of the 20th IEEE international conference on software reliability engineering
A query language for understanding component interactions in production systems

Proceedings of the 24th ACM International Conference on Supercomputing
Analysis of execution log files

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
End-to-end framework for fault management for open source clusters: Ranger

Proceedings of the 2010 TeraGrid Conference
Mining invariants from console logs for system problem detection

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Error log processing for accurate failure prediction

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
SALSA: analyzing logs as state machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Mining console logs for large-scale system problem detection

SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Online event correlations analysis in system logs of large-scale cluster systems

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Symptom-based problem determination using log data abstraction

Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
A graphical representation for identifier structure in logs

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Experience mining Google's production console logs

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Bridging the gaps: joining information sources with Splunk

SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Storage and retrieval of system log events using a structured schema based on message type transformation

Proceedings of the 2011 ACM Symposium on Applied Computing
Using automatic persistent memoization to facilitate data analysis scripting

Proceedings of the 2011 International Symposium on Software Testing and Analysis
Sloppy Python: using dynamic analysis to automatically add error tolerance to ad-hoc data processing scripts

Proceedings of the Ninth International Workshop on Dynamic Analysis
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Smarter log analysis

IBM Journal of Research and Development
LogSig: generating system events from raw textual logs

Proceedings of the 20th ACM international conference on Information and knowledge management
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Provenance for system troubleshooting

LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Spatio-temporal decomposition, clustering and identification for alert detection in system logs

Proceedings of the 27th Annual ACM Symposium on Applied Computing
3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Heterogeneity and dynamicity of clouds at scale: Google trace analysis

Proceedings of the Third ACM Symposium on Cloud Computing
On the Path to Exascale

International Journal of Distributed Systems and Technologies
Provenance from log files: a BigData problem

Proceedings of the Joint EDBT/ICDT 2013 Workshops
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Structured and Interoperable Logging for the Cloud Computing Era: The Pitfalls and Benefits

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

If we hope to automatically detect and diagnose failures in large-scale computer systems, we must study real deployed systems and the data they generate. Progress has been hampered by the inaccessibility of empirical data. This paper addresses that dearth by examining system logs from five supercomputers, with the aim of providing useful insight and direction for future research into the use of such logs. We present details about the systems, methods of log collection, and how alerts were identified; propose a simpler and more effective filtering algorithm; and define operational context to encompass the crucial information that we found to be currently missing from most logs. The machines we consider (and the number of processors) are: Blue Gene/L (131072), Red Storm (10880), Thunderbird (9024), Spirit (1028), and Liberty (512). This is the first study of raw system logs from multiple supercomputers.