Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

Authors:
Gabriel Mateescu;Gregory H. Bauer;Robert A. Fiedler
Affiliations:
Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland;National Center for Supercomputing Applications, Urbana, IL, USA;National Center for Supercomputing Applications, Urbana, IL, USA
Venue:
ACM SIGMETRICS Performance Evaluation Review
Year:
2012

Citing 6
Cited 1

Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Automatic benchmark generation for cache optimization of matrix operations

ACM-SE 33 Proceedings of the 33rd annual on Southeast regional conference
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
Power7: IBM's Next-Generation Server Processor

IEEE Micro
IBM POWER7 multicore server processor

IBM Journal of Research and Development
PERCS: the IBM power7-IH high-performance computing system

IBM Journal of Research and Development

Optimizing matrix transposes using a POWER7 cache model and explicit prefetching

Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of efficiently computing matrix transposes on the POWER7 architecture. We develop a matrix transpose algorithm that uses cache blocking, cache prefetching and data alignment. We model the POWER7 data cache and memory concurrency and use the model to predict the memory throughput of the proposed matrix transpose algorithm. The performance of our matrix transpose algorithm is up to five times higher than that of the dgetmo routine of the Engineering and Scientific Subroutine Library and is 2.5 times higher than that of the code generated by compiler-inserted prefetching. Numerical experiments indicate a good agreement between the predicted and the measured memory throughput.