REMO: Resource-Aware Application State Monitoring for Large-Scale Distributed Systems

Authors:
Shicong Meng;Srinivas R. Kashyap;Chitra Venkatramani;Ling Liu
Affiliations:
-;-;-;-
Venue:
ICDCS '09 Proceedings of the 2009 29th IEEE International Conference on Distributed Computing Systems
Year:
2009

Citing 0
Cited 7

Visualizing large-scale streaming applications

Information Visualization
Distributed middleware reliability and fault tolerance support in system S

Proceedings of the 5th ACM international conference on Distributed event-based system
A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere Streams

Software—Practice & Experience
A decentralized approach for mining event correlations in distributed system monitoring

Journal of Parallel and Distributed Computing
Aggregation for implicit invocations

Proceedings of the 12th annual international conference on Aspect-oriented software development
Monitoring-as-a-service in the cloud: spec phd award (invited abstract)

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

To observe, analyze and control large scale distributed systems and the applications hosted on them, there is an increasing need to continuously monitor performance attributes of distributed system and application states. This results in application state monitoring tasks that require fine-grained attribute information to be collected from relevant nodes efficiently. Existing approaches either treat multiple application state monitoring tasks independently and build ad-hoc monitoring trees for each task, or construct a single static monitoring tree for multiple tasks. We argue that a careful planning of multiple application state monitoring tasks by jointly considering multi-task optimization and node level resource constraints can provide significant gains in performance and scalability. In this paper, we present REMO, a REsource-aware application state MOnitoring system. REMO produces a forest of optimized monitoring trees through iterations of two phases, one phase exploring cost sharing opportunities via estimation and the other refining the monitoring plan through resource-sensitive tree construction. Our experimental results include those gathered by deploying REMO on a BlueGene/P rack running IBM's large-scale distributed streaming system - System S. Using REMO running over 200 monitoring tasks for an application deployed across 200 nodes results in a 35%-45% decrease in the percentage error of collected attributes compared to existing schemes.