Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The Tau Parallel Performance System
International Journal of High Performance Computing Applications
Problem diagnosis in large-scale computing environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
PVFS: a parallel file system for linux clusters
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Black-box problem diagnosis in parallel file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Detecting application-level failures in component-based Internet services
IEEE Transactions on Neural Networks
Automated diagnosis without predictability is a recipe for failure
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Hi-index | 0.00 |
We present a behavior-based problem-diagnosis approach for PVFS that analyzes a novel source of instrumentation--CPU instruction-pointer samples and function-call traces--to localize the faulty server and to enable root-cause analysis of the resource at fault. We validate our approach by injecting realistic storage and network problems into three different workloads (dd, IO-zone, and PostMark) on a PVFS cluster.