Healing online service systems via mining historical issue repositories

Authors:
Rui Ding;Qiang Fu;Jian-Guang Lou;Qingwei Lin;Dongmei Zhang;Jiajun Shen;Tao Xie
Affiliations:
Microsoft Research, China;Microsoft Research, China;Microsoft Research, China;Microsoft Research, China;Microsoft Research, USA;Shanghai Jiao Tong University, China;North Carolina State University, USA
Venue:
Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
Year:
2012

Citing 6
Cited 3

A longitudinal survey of Internet host reliability

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Formal concept analysis applied to fault localization

Companion of the 30th international conference on Software engineering
What's going on?: learning communication rules in edge networks

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
DebugAdvisor: a recommender system for debugging

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Mining invariants from console logs for system problem detection

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference

Pathways to technology transfer and adoption: achievements and challenges (mini-tutorial)

Proceedings of the 2013 International Conference on Software Engineering
Software analytics: achievements and challenges

Proceedings of the 2013 International Conference on Software Engineering
Report on the international symposium on high confidence software (ISHCS 2011/2012)

ACM SIGSOFT Software Engineering Notes

Quantified Score

Hi-index	0.00

Visualization

Abstract

Online service systems have been increasingly popular and important nowadays, with an increasing demand on the availability of services provided by these systems, while significant efforts have been made to strive for keeping services up continuously. Therefore, reducing the MTTR (Mean Time to Restore) of a service remains the most important step to assure the user-perceived availability of the service. To reduce the MTTR, a common practice is to restore the service by identifying and applying an appropriate healing action (i.e., a temporary workaround action such as rebooting a SQL machine). However, manually identifying an appropriate healing action for a given new issue (such as service down) is typically time consuming and error prone. To address this challenge, in this paper, we present an automated mining-based approach for suggesting an appropriate healing action for a given new issue. Our approach generates signatures of an issue from its corresponding transaction logs and then retrieves historical issues from a historical issue repository. Finally, our approach suggests an appropriate healing action by adapting healing actions for the retrieved historical issues. We have implemented a healing suggestion system for our approach and applied it to a real-world product online service that serves millions of online customers globally. The studies on 77 incidents (severe issues) over 3 months showed that our approach can effectively provide appropriate healing actions to reduce the MTTR of the service.