GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems

Authors:
Rajagopal Subramaniyan;Pirabhu Raman;Alan D. George;Matthew Radlinski
Affiliations:
High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville 32611-6200;High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville 32611-6200;High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville 32611-6200;High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville 32611-6200
Venue:
Cluster Computing
Year:
2006

Citing 17
Cited 3

A Trace-Driven Simulation Study of Dynamic Load Balancing

IEEE Transactions on Software Engineering
Semi-Distributed Load Balancing for Massively Parallel Multicomputer Systems

IEEE Transactions on Software Engineering
A Dynamic Load-Balancing Policy with a Central Job Dispatcher (LBC)

IEEE Transactions on Software Engineering
Customized dynamic load balancing for a network of workstations

Journal of Parallel and Distributed Computing
The network weather service: a distributed resource performance forecasting service for metacomputing

Future Generation Computer Systems - Special issue on metacomputing
PARMON: a portable and scalable monitoring system for clusters

Software—Practice & Experience
Dynamically forecasting network performance using the Network Weather Service

Cluster Computing
Simulative performance analysis of gossip failure detection for scalable distributed systems

Cluster Computing
Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters

Cluster Computing
Strategies for Dynamic Load Balancing on Highly Parallel Computers

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Flat and Layered Gossip Services for Failure Detection and Consensus in Scalable Heterogeneous Clusters

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining

ACM Transactions on Computer Systems (TOCS)
Experimental Analysis of a Gossip-Based Service for Scalable, Distributed Failure Detection and Consensus

Cluster Computing
Managing Network Resources in Condor

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
ClusterProbe: An Open, Flexible and Scalable Cluster Monitoring Tool

IWCC '99 Proceedings of the 1st IEEE Computer Society International Workshop on Cluster Computing
Scalability of the microsoft cluster service

WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Gossiping in distributed systems

ACM SIGOPS Operating Systems Review - Gossip-based computer networking
Design of a hierarchical global scale cluster system

ICACT'09 Proceedings of the 11th international conference on Advanced Communication Technology - Volume 3
A gossip-based approach to exascale system services

Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gossip protocols have proven to be effective means by which failures can be detected in large, distributed systems in an asynchronous manner without the limitations associated with reliable multicasting for group communications. In this paper, we discuss the development and features of a Gossip-Enabled Monitoring Service (GEMS), a highly responsive and scalable resource monitoring service, to monitor health and performance information in heterogeneous distributed systems. GEMS has many novel and essential features such as detection of network partitions and dynamic insertion of new nodes into the service. Easily extensible, GEMS also incorporates facilities for distributing arbitrary system and application-specific data. We present experiments and analytical projections demonstrating scalability, fast response times and low resource utilization requirements, making GEMS a potent solution for resource monitoring in distributed computing.