Scalability of the microsoft cluster service

Authors:
Werner Vogels;Dan Dumitriu;Ashutosh Agrawal;Teck Chia;Katherine Guo
Affiliations:
Department of Computer Science, Cornell University;Department of Computer Science, Cornell University;Department of Computer Science, Cornell University;Department of Computer Science, Cornell University;Department of Computer Science, Cornell University
Venue:
WINSYM'98 Proceedings of the 2nd conference on USENIX Windows NT Symposium - Volume 2
Year:
1998

Citing 7
Cited 9

Totem: a fault-tolerant multicast group communication system

Communications of the ACM
Building secure and reliable network applications

Building secure and reliable network applications
Building adaptive systems using ensemble

Software—Practice & Experience - Special issue on multiprocessor operating systems
Six misconceptions about reliable distributed computing

Proceedings of the 8th ACM SIGOPS European workshop on Support for composing distributed applications
World wide failures

EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
The Design and Architecture of the Microsoft Cluster Service - A Practical Approach to High-Availability and Scalability

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Evolution of the Virtual Interface Architecture

Computer
Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems

Cluster Computing
Treating bugs as allergies: a safe method for surviving software failures

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Sweeper: a lightweight end-to-end system for defending against fast worms

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
CassMail: a scalable, highly-available, and rapidly-prototyped e-mail service

Proceedings of the 11th IFIP WG 6.1 international conference on Distributed applications and interoperable systems
LibRe: a consistency protocol for modern storage systems

Proceedings of the 6th ACM India Computing Convention

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important argument for the introduction of software managed clusters is that of scale: By constructing the cluster out of commodity compute elements, one can, by simply adding new elements, improve the reliability of the overall system in terms of performance and in availability. The limits to how far such a cluster can be scaled seems to be dependent on the scalability of its management software, which in its core has a collection of distributed algorithms to guarantee the correct operation of the cluster. The complexity of these algorithms makes them a vulnerable component of the system in terms of their impact on the overall scalability of the system. This paper examines two of the distributed components of the Microsoft Cluster Service [8] that are most likely to have an impact on its scalability: the membership and the global update managers. The first sections of the paper will provide some general background on these distributed services and scalability issues. After that the algorithms used to implement these service are described in detail and an analysis of their impact on scalability is given. The scalability analysis is based on an off-line analysis of the algorithms as well as the results of on-line experiments on a cluster with a, in MSCS terms, large number of nodes.