Debugging in the (very) large: ten years of implementation and experience

Authors:
Kirk Glerum;Kinshuman Kinshumann;Steve Greenberg;Gabriel Aul;Vince Orgovan;Greg Nichols;David Grant;Gretchen Loihle;Galen Hunt
Affiliations:
Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA
Venue:
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Year:
2009

Citing 21
Cited 47

With microscope and tweezers: the worm from MIT's perspective

Communications of the ACM
A static analyzer for finding dynamic programming errors

Software—Practice & Experience
Further analysis of a computing center environment

Communications of the ACM
Bugs as deviant behavior: a general approach to inferring errors in systems code

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
The SLAM project: debugging system software via static analysis

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Bug isolation via remote program sampling

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Automating Software Failure Reporting

Queue - System Failures
Crash Data Collection: A Windows Case Study

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Vigilante: end-to-end containment of internet worms

Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Planet scale software updates

Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Argos: an emulator for fingerprinting zero-day attacks for advertised honeypots with automatic signature generation

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Thorough static analysis of device drivers

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Scrash: a system for generating secure crash information

SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
Windows XP kernel crash analysis

LISA '06 Proceedings of the 20th conference on Large Installation System Administration
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Better bug reporting with better privacy

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
The Whirlwind I computer

AIEE-IRE '51 Papers and discussions presented at the Dec. 10-12, 1951, joint AIEE-IRE computer conference: Review of electronic digital computers
Deadlock immunity: enabling systems to defend against deadlocks

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Formal specifications on industrial-strength code—from myth to reality

CAV'06 Proceedings of the 18th international conference on Computer Aided Verification

Fast byte-granularity software fault isolation

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Dynamically checking ownership policies in concurrent c/c++ programs

Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
SherLog: error diagnosis by connecting clues from run-time logs

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Fingerprinting the datacenter: automated classification of performance crises

Proceedings of the 5th European conference on Computer systems
Towards versatile performance models for complex, popular applications

ACM SIGMETRICS Performance Evaluation Review
A query language for understanding component interactions in production systems

Proceedings of the 24th ACM International Conference on Supercomputing
Practical performance models for complex, popular applications

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Mugshot: deterministic capture and replay for Javascript applications

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Tolerating malicious device drivers in Linux

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Trevis: a context tree visualization & analysis framework and its use for classifying performance failure reports

Proceedings of the 5th international symposium on Software visualization
Low-overhead bug fingerprinting for fast debugging

RV'10 Proceedings of the First international conference on Runtime verification
Improving software diagnosability via log enhancement

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Advancing the state of home networking

Communications of the ACM
Striking a new balance between program instrumentation and debugging time

Proceedings of the sixth conference on Computer systems
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
Characterizing the differences between pre- and post- release versions of software

Proceedings of the 33rd International Conference on Software Engineering
If the SOK fits, wear it: pragmatic process improvement through software operation knowledge

PROFES'11 Proceedings of the 12th international conference on Product-focused software process improvement
Fay: extensible distributed tracing from kernels to clusters

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Catch me if you can: performance bug detection in the wild

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Software analytics as a learning case in practice: approaches and experiences

Proceedings of the International Workshop on Machine Learning Technologies in Software Engineering
The power of propagation: on the role of software operation knowledge within software ecosystems

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
RTRP: right time right place kernel analysis tool

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Improving Software Diagnosability via Log Enhancement

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Data races vs. data race bugs: telling the difference with portend

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
scc: cluster storage provisioning informed by application characteristics and SLAs

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Performance debugging in the large via mining millions of stack traces

Proceedings of the 34th International Conference on Software Engineering
Information needs for software development analytics

Proceedings of the 34th International Conference on Software Engineering
Characterizing and predicting which bugs get reopened

Proceedings of the 34th International Conference on Software Engineering
ReBucket: a method for clustering duplicate crash reports based on call stack similarity

Proceedings of the 34th International Conference on Software Engineering
Rivet: browser-agnostic remote debugging for web applications

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Predicting recurring crash stacks

Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
Fay: Extensible Distributed Tracing from Kernels to Clusters

ACM Transactions on Computer Systems (TOCS)
CoRD: a collaborative framework for distributed data race detection

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Collaborative energy debugging for mobile devices

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
AppInsight: mobile app performance monitoring in the wild

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Be conservative: enhancing failure diagnosis with proactive logging

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Production-run software failure diagnosis via hardware performance counters

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Using likely invariants for automated software fault localization

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
More testers - The effect of crowd size and time restriction in software testing

Information and Software Technology
Chronicler: lightweight recording to reproduce field failures

Proceedings of the 2013 International Conference on Software Engineering
Automated debugging for arbitrarily long executions

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Interactive record/replay for web application debugging

Proceedings of the 26th annual ACM symposium on User interface software and technology
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Carat: collaborative energy diagnosis for mobile devices

Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
VirtuOS: an operating system with kernel virtualization

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
RaceMob: crowdsourced data race detection

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
DeltaPath: Precise and Scalable Calling Context Encoding

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.02

Visualization

Abstract

Windows Error Reporting (WER) is a distributed system that automates the processing of error reports coming from an installed base of a billion machines. WER has collected billions of error reports in ten years of operation. It collects error data automatically and classifies errors into buckets, which are used to prioritize developer effort and report fixes to users. WER uses a progressive approach to data collection, which minimizes overhead for most reports yet allows developers to collect detailed information when needed. WER takes advantage of its scale to use error statistics as a tool in debugging; this allows developers to isolate bugs that could not be found at smaller scale. WER has been designed for large scale: one pair of database servers can record all the errors that occur on all Windows computers worldwide.