An automated approach for abstracting execution logs to execution events
Journal of Software Maintenance and Evolution: Research and Practice - Special Issue on Program Comprehension through Dynamic Analysis (PCODA)
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Fault injection framework for system resilience evaluation: fake faults for finding future failures
Proceedings of the 2009 workshop on Resiliency in high performance
International Journal of High Performance Computing Applications
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
International Journal of High Performance Computing Applications
Diagnosis of recurrent faults using log files
CASCON '09 Proceedings of the 2009 Conference of the Center for Advanced Studies on Collaborative Research
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
Efficiently extracting operational profiles from execution logs using suffix arrays
ISSRE'09 Proceedings of the 20th IEEE international conference on software reliability engineering
A query language for understanding component interactions in production systems
Proceedings of the 24th ACM International Conference on Supercomputing
Analysis of execution log files
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2
End-to-end framework for fault management for open source clusters: Ranger
Proceedings of the 2010 TeraGrid Conference
Mining invariants from console logs for system problem detection
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Error log processing for accurate failure prediction
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
SALSA: analyzing logs as state machines
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Mining console logs for large-scale system problem detection
SysML'08 Proceedings of the Third conference on Tackling computer systems problems with machine learning techniques
Online event correlations analysis in system logs of large-scale cluster systems
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Symptom-based problem determination using log data abstraction
Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
A graphical representation for identifier structure in logs
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Experience mining Google's production console logs
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Bridging the gaps: joining information sources with Splunk
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Proceedings of the 2011 ACM Symposium on Applied Computing
Using automatic persistent memoization to facilitate data analysis scripting
Proceedings of the 2011 International Symposium on Software Testing and Analysis
Proceedings of the Ninth International Workshop on Dynamic Analysis
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
IBM Journal of Research and Development
LogSig: generating system events from raw textual logs
Proceedings of the 20th ACM international conference on Information and knowledge management
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Provenance for system troubleshooting
LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Spatio-temporal decomposition, clustering and identification for alert detection in system logs
Proceedings of the 27th Annual ACM Symposium on Applied Computing
3-Dimensional root cause diagnosis via co-analysis
Proceedings of the 9th international conference on Autonomic computing
Fault prediction under the microscope: a closer look into HPC systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Heterogeneity and dynamicity of clouds at scale: Google trace analysis
Proceedings of the Third ACM Symposium on Cloud Computing
International Journal of Distributed Systems and Technologies
Provenance from log files: a BigData problem
Proceedings of the Joint EDBT/ICDT 2013 Workshops
The Journal of Supercomputing
Structured and Interoperable Logging for the Cloud Computing Era: The Pitfalls and Benefits
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Hi-index | 0.00 |
If we hope to automatically detect and diagnose failures in large-scale computer systems, we must study real deployed systems and the data they generate. Progress has been hampered by the inaccessibility of empirical data. This paper addresses that dearth by examining system logs from five supercomputers, with the aim of providing useful insight and direction for future research into the use of such logs. We present details about the systems, methods of log collection, and how alerts were identified; propose a simpler and more effective filtering algorithm; and define operational context to encompass the crucial information that we found to be currently missing from most logs. The machines we consider (and the number of processors) are: Blue Gene/L (131072), Red Storm (10880), Thunderbird (9024), Spirit (1028), and Liberty (512). This is the first study of raw system logs from multiple supercomputers.