The Impact of noise on the scaling of collectives: the nearest neighbor model

Authors:
Nisheeth K. Vishnoi
Affiliations:
University of California Berkeley, CA
Venue:
HiPC'07 Proceedings of the 14th international conference on High performance computing
Year:
2007

Citing 8
Cited 0

Highly efficient gang scheduling implementation

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
STORM: lightning-fast resource management

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A performance analysis of local synchronization

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Impact of noise on scaling of collectives: an empirical evaluation

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
The impact of noise on the scaling of collectives: a theoretical approach

HiPC'05 Proceedings of the 12th international conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a theoretical study of the impact of noise on the scaling of a cluster when the processors participate in "local" collectives with their nearest neighbors. The model considered here is an extension of that introduced in [9] for understanding the effect of noise on the scaling of "global" collectives in large clusters. In this paper, the scaling is studied with respect to three fundamental aspects: (1) the distribution of noise: whether it is heavy or light tailed; (2) the temporal independence of noise; (3) the topology of the cluster. When the noise has a "light" tail and is temporally independent, it is shown that the cluster scales well, i.e., the slowdown per phase is just proportional to the (logarithm of the) maximum degree of the communication topology. This implies that for popular topologies such as grids and toruses the slowdown per phase is just a constant factor, which is independent of the number of processors. In the light tailed case, assuming only a weak temporal independence, a general upper bound is derived in terms of an "expansion" parameter of the communication topology. For grid-like graphs this establishes an exponential speedup compared to what was shown for global collective operations in [9].