Improving Log-based Field Failure Data Analysis of multi-node computing systems

Authors:
Antonio Pecchia;Domenico Cotroneo;Zbigniew Kalbarczyk;Ravishankar K. Iyer
Affiliations:
Dipartimento di Informatica e Sistemistica, Universitè degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy;Dipartimento di Informatica e Sistemistica, Universitè degli Studi di Napoli Federico II, Via Claudio 21, 80125, Naples, Italy;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1308 W. Main Street, 61801, USA;Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 1308 W. Main Street, 61801, USA
Venue:
DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
Year:
2011

Citing 0
Cited 2

3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Log-based Field Failure Data Analysis (FFDA) is a widely-adopted methodology to assess dependability properties of an operational system. A key step in FFDA is filtering out entries that are not useful and redundant error entries from the log. The latter is challenging: a fault, once triggered, can generate multiple errors that propagate within the system. Grouping the error entries related to the same fault manifestation is crucial to obtain realistic measurements. This paper deals with the issues of the tuple heuristic, used to group the error entries in the log, in multi-node computing systems. We demonstrate that the tuple heuristic can group entries incorrectly; thus, an improved heuristic that adopts statistical indicators is proposed. We assess the impact of inaccurate grouping on dependability measurements by comparing the results obtained with both the heuristics. The analysis encompasses the log of the Mercury cluster at the National Center for Supercomputing Applications.