Runtime optimization of vector operations on large scale SMP clusters

Authors:
Costin Iancu;Steven Hofmeyr
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Venue:
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Year:
2008

Citing 16
Cited 0

Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
GASNet Specification, v1.1

GASNet Specification, v1.1
A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping

Journal of Parallel and Distributed Computing
A Multi-Platform Co-Array Fortran Compiler

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Automatic Support for Irregular Computations in a High-Level Language

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A framework for adaptive algorithm selection in STAPL

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing Strided Remote Memory Access Operations on the Quadrics QsNetII Network Interconnect

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Optimizing communication overlap for high-speed networks

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
$\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems

IEEE Transactions on Computers
Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance portable optimizations for loops containing communication operations

Proceedings of the 22nd annual international conference on Supercomputing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

"Vector" style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes. The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.