Monitoring distributed systems
ACM Transactions on Computer Systems (TOCS)
Debugging heterogeneous distributed systems using event-based models of behavior
ACM Transactions on Computer Systems (TOCS)
Making paths explicit in the Scout operating system
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Eraser: a dynamic data race detector for multithreaded programs
ACM Transactions on Computer Systems (TOCS)
The Coign automatic distributed partitioning system
OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Dynamically Discovering Likely Program Invariants to Support Program Evolution
IEEE Transactions on Software Engineering - Special issue on 1999 international conference on software engineering
Bugs as deviant behavior: a general approach to inferring errors in systems code
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
SEDA: an architecture for well-conditioned, scalable internet services
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Coupled hidden Markov models for complex action recognition
CVPR '97 Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR '97)
Self-Monitoring and Self-Adapting Operating Systems
HOTOS '97 Proceedings of the 6th Workshop on Hot Topics in Operating Systems (HotOS-VI)
HOTOS '99 Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems
Using history to improve mobile application adaptation
WMCSA '00 Proceedings of the Third IEEE Workshop on Mobile Computing Systems and Applications (WMCSA'00)
HiFi: A New Monitoring Architecture for Distributed Systems Management
ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
An online evolutionary approach to developing internet services
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Using end-user latency to manage internet infrastructure
WIESS'02 Proceedings of the 2nd conference on Industrial Experiences with Systems Software - Volume 2
Measuring and characterizing system behavior using kernel-level event logging
ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
Detours: binary interception of Win32 functions
WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
IMPuLSE: integrated monitoring and profiling for large-scale environments
LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Deconstructing Commodity Storage Clusters
Proceedings of the 32nd annual international symposium on Computer Architecture
Combining statistical monitoring and predictable recovery for self-management
WOSS '04 Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems
Virtual private machines: user-centric performance
Proceedings of the 11th workshop on ACM SIGOPS European workshop
Proceedings of the 11th workshop on ACM SIGOPS European workshop
Request extraction in Magpie: events, schemas and temporal joins
Proceedings of the 11th workshop on ACM SIGOPS European workshop
Problem diagnosis in large-scale computing environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Automated known problem diagnosis with event traces
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Mace: language support for building distributed systems
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Understanding and dealing with operator mistakes in internet services
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Failure Detection in Large-Scale Internet Services by Principal Subspace Mapping
IEEE Transactions on Knowledge and Data Engineering
Automated Rule-Based Diagnosis through a Distributed Monitor System
IEEE Transactions on Dependable and Secure Computing
Bridging the application and DBMS profiling divide for database application developers
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
IEEE Transactions on Knowledge and Data Engineering
Reliable framework for RFID devices
Proceedings of the 5th Middleware doctoral symposium
Diagnosing distributed systems with self-propelled instrumentation
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
System monitoring with metric-correlation models: problems and solutions
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Suelo: human-assisted sensing for exploratory soil monitoring studies
Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems
Heteroscedastic models to track relationships between management metrics
IM'09 Proceedings of the 11th IFIP/IEEE international conference on Symposium on Integrated Network Management
Performance debugging in data centers: doing more with less
COMSNETS'09 Proceedings of the First international conference on COMmunication Systems And NETworks
PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems
Proceedings of the 7th international conference on Autonomic computing
CLUEBOX: a performance log analyzer for automated troubleshooting
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Scoped identifiers for efficient bit aligned logging
Proceedings of the Conference on Design, Automation and Test in Europe
Automating configuration troubleshooting with dynamic information flow analysis
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Analyzing web logs to detect user-visible failures
SLAML'10 Proceedings of the 2010 workshop on Managing systems via log analysis and machine learning techniques
Root-cause analysis of performance anomalies in web-based applications
Proceedings of the 2011 ACM Symposium on Applied Computing
HiTune: dataflow-based performance analysis for big data cloud
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Modeling the parallel execution of black-box services
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
HiTune: dataflow-based performance analysis for big data cloud
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Measuring the dependability of web services for use in e-science experiments
ISAS'06 Proceedings of the Third international conference on Service Availability
DAPA: diagnosing application performance anomalies for virtualized infrastructures
Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Dealer: application-aware request splitting for interactive cloud applications
Proceedings of the 8th international conference on Emerging networking experiments and technologies
P4-simsaas: policy specification for Multi-Tendency simulation software-as-a-service model
Proceedings of the Winter Simulation Conference
Understanding latency variations of black box services
Proceedings of the 22nd international conference on World Wide Web
Hi-index | 0.00 |
Understanding the performance of distributed systems requires correlation of thousands of interactions between numerous components -- a task best left to a computer. Today's systems provide voluminous traces from each component but do not synthesise the data into concise models of system performance. We argue that online performance modelling should be a ubiquitous operating system service and outline several uses including performance debugging, capacity planning, system tuning and anomaly detection. We describe the Magpie modelling service which collates detailed traces from multiple machines in an e-commerce site, extracts request-specific audit trails, and constructs probabilistic models of request behaviour. A feasibility study evaluates the approach using an offline demonstrator. Results show that the approach is promising, but that there are many challenges to building a truly ubiquitious, online modelling infrastructure.