Identifying faults in large-scale distributed systems by filtering noisy error logs

Authors:
Xiang Rao;Huaimin Wang;Dianxi Shi;Zhenbang Chen;Hua Cai;Qi Zhou;Tingtao Sun
Affiliations:
National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, P.R. China 410073;National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, P.R. China 410073;National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, P.R. China 410073;National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, P.R. China 410073;Computing Platform, Alibaba Cloud Computing Corporation, Hangzhou, P.R. China;Computing Platform, Alibaba Cloud Computing Corporation, Hangzhou, P.R. China;Computing Platform, Alibaba Cloud Computing Corporation, Hangzhou, P.R. China
Venue:
DSNW '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops
Year:
2011

Citing 0
Cited 1

3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Extracting fault features with the error logs of fault injection tests has been widely studied in the area of large scale distributed systems for decades. However, the process of extracting features is severely affected by a large amount of noisy logs. While the existing work tries to solve the problem by compressing logs in temporal and spatial views or removing the semantic redundancy between logs, they fail to consider the co-existence of other noisy faults that generate error logs instead of injected faults, for example, random hardware faults, unexpected bugs of softwares, system configuration faults or the error rank of a log severity. During a fault feature extraction process, those noisy faults generate error logs that are not related to a target fault, and will strongly mislead the resulted fault features. We call an error log that is not related to a target fault a noisy error log. To filter out noisy error logs, we present a similarity-based error log filtering method SBF, which consists of three integrated steps: (1) model error logs into time series and use haar wavelet transform to get the approximate time series; (2) divide the approximate time series into sub time series by valleys; (3) identify noisy error logs by comparing the similarity between the sub time series of target error logs and the template of noisy error logs. We apply our log filtering method in an enterprise cloud system and show its effectiveness. Compared with the existing work, we successfully filter out noisy error logs and increase the precision and the recall rate of fault feature extraction.1