MyExperience: a system for in situ tracing and capturing of user feedback on mobile phones
Proceedings of the 5th international conference on Mobile systems, applications and services
Triage: diagnosing production run failures at the user's site
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
PDA: a tool for automated problem determination
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Snitch: interactive decision trees for troubleshooting misconfigurations
SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
D3S: debugging deployed distributed systems
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Co-designing the failure analysis and monitoring of large-scale systems
ACM SIGMETRICS Performance Evaluation Review
Self-correlating predictive information tracking for large-scale production systems
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Detailed diagnosis in enterprise networks
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Lightweight, high-resolution monitoring for troubleshooting production systems
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
OLIC: online information compression for scalable hosting infrastructure monitoring
Proceedings of the Nineteenth International Workshop on Quality of Service
Context-based online configuration-error detection
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
BLR-D: applying bilinear logistic regression to factored diagnosis problems
SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Fay: extensible distributed tracing from kernels to clusters
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
BLR-D: applying bilinear logistic regression to factored diagnosis problems
ACM SIGOPS Operating Systems Review
Fay: Extensible Distributed Tracing from Kernels to Clusters
ACM Transactions on Computer Systems (TOCS)
System-Level support for intrusion recovery
DIMVA'12 Proceedings of the 9th international conference on Detection of Intrusions and Malware, and Vulnerability Assessment
Hi-index | 0.00 |
Mismanagement of the persistent state of a system-all the executable files, configuration settings and other data that govern how a system functions-causes reliability problems, security vulnerabilities, and drives up operation costs. Recent research traces persistent state interactions-how state is read, modified, etc.-to help troubleshooting, change management and malware mitigation, but has been limited by the difficulty of collecting, storing, and analyzing the 10s to 100s of millions of daily events that occur on a single machine, much less the 1000s or more machines in many computing environments. We present the Flight Data Recorder (FDR) that enables always-on tracing, storage and analysis of persistent state interactions. FDR uses a domain-specific log format, tailored to observed file system workloads and common systems management queries. Our lossless log format compresses logs to only 0.5-0.9 bytes per interaction. In this log format, 1000 machine-days of logs-over 25 billion events-can be analyzed in less than 30 minutes. We report on our deployment of FDR to 207 production machines at MSN, and show that a single centralized collection machine can potentially scale to collecting and analyzing the complete records of persistent state interactions from 4000+ machines. Furthermore, our tracing technology is shipping as part of the Windows Vista OS.