Why PCs Are Fragile and What We Can Do About It: A Study of Windows Registry Problems
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support
LISA '03 Proceedings of the 17th USENIX conference on System administration
LISA '04 Proceedings of the 18th USENIX conference on System administration
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
Automated known problem diagnosis with event traces
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Using computers to diagnose computer problems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Automatic misconfiguration troubleshooting with peerpressure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Flight data recorder: monitoring persistent-state interactions to improve systems management
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Auto-diagnosis of field problems in an appliance operating system
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Windows XP kernel crash analysis
LISA '06 Proceedings of the 20th conference on Large Installation System Administration
Automatic software fault diagnosis by exploiting application signatures
LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
A reactive measurement framework
PAM'08 Proceedings of the 9th international conference on Passive and active network measurement
Symptom-based problem determination using log data abstraction
Proceedings of the 2010 Conference of the Center for Advanced Studies on Collaborative Research
AHAFS subsystem for enhancing operating system health in the cloud computing era
IBM Journal of Research and Development
Assisting failure diagnosis through filesystem instrumentation
Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
Provenance for system troubleshooting
LISA'11 Proceedings of the 25th international conference on Large Installation System Administration
Hi-index | 0.00 |
Problem determination remains one of the most expensive and time-consuming functions in system management due to the difficulty in automating what is essentially a highly experience-dependent task. In this paper we study the characteristics of problem tickets in an enterprise IT infrastructure and observe that most of the tickets come from very few products and modules, and OS problems present higher resolving duration. We propose PDA, a problem management tool that provides automated problem diagnosis capabilities to assist system administrators in solving real-world problems more efficiently. PDA uses a two-level approach of proactive, high-level system health checks, coupled with rule-based "drill-down" probing to automatically collect detailed information related to the problem. Our tool allows system administrators to author and customize probes and rules accordingly and share across the organization. We illustrate the usage and benefits of PDA with a number of UNIX problem scenarios that show PDA is able to quickly collect key information through its rules to aid in problem determination.