Performance and scalability analysis of cray x1 vectorization and multistreaming optimization

Authors:
Sadaf Alam;Jeffrey Vetter
Affiliations:
Computer Science and Mathematics Division, Oak Ridge National Laboratory;Computer Science and Mathematics Division, Oak Ridge National Laboratory
Venue:
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Year:
2005

Citing 6
Cited 0

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Performance characteristics of the Cray X1 and their implications for application performance tuning

Proceedings of the 18th annual international conference on Supercomputing
Early Evaluation of the Cray X1

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Evaluation of the Cray X1 Distributed Shared-Memory Architecture

IEEE Micro

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cray X1 Fortran and C/C++ compilers provide a number of loop transformations, notably vectorization and multistreaming, in order to exploit the multistreaming processor (MSP) hardware resources and its high memory bandwidth. A Cray X1 node is composed of four MSPs, which in turn are composed of four single streaming processors (SSP). Each SSP contains a superscalar processing unit and two vector processing units. Compiler vectorization provides loop level parallelization and uses the vector processing hardware. Multistreaming code generation by the compiler permits execution across the SSPs of an MSP on a block of code. In this paper, we analyze overall impact of loop-level compiler optimization on a scientific application called Parallel Ocean Program (POP). POP has been extensively optimized for X1 by instrumenting the code using X1 compiler directives. We compare and contrast automatic and manual optimization schemes available on X1 and analyze their impact on the code performance and scalability. Our results show that the addition of compiler directives increases the average vector length, thereby improving the single node performance significantly. However, this code scales at a slower rate as the local workload volume decreases and the communication costs increase.