IEEE Transactions on Software Engineering - Special issue on computer security and privacy
Multivariate resource performance forecasting in the network weather service
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Dynamic Monitoring of High-Performance Distributed Applications
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
On-Demand Grid Application Tuning and Debugging with the NetLogger Activation Service
GRID '03 Proceedings of the 4th International Workshop on Grid Computing
The Grid2003 Production Grid: Principles and Practice
HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
An integrated experimental environment for distributed systems and networks
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Ensembles of Models for Automated Diagnosis of System Performance Problems
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
The GrADS Project: Software Support for High-Level Grid Application Development
International Journal of High Performance Computing Applications
The Globus Striped GridFTP Framework and Server
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Using Dynamic Tracing Sampling to Measure Long Running Programs
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Problem diagnosis in large-scale computing environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Detecting performance anomalies in global applications
WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Anomaly management in grid environments
Anomaly management in grid environments
Troubleshooting thousands of jobs on production grids using data mining techniques
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Decentralized log event correlation architecture
Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Hunting for problems with Artemis
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Instrumentation-based tool for latency measurements
Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
System log summarization via semi-Markov models of inter-arrival times
Proceedings of the Seventh Annual Workshop on Cyber Security and Information Intelligence Research
Failure analysis of distributed scientific workflows executing in the cloud
Proceedings of the 8th International Conference on Network and Service Management
Hi-index | 0.00 |
Today’s system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.