A Performance Model of Dense Matrix Operations on Many-Core Architectures

Authors:
Guoping Long;Dongrui Fan;Junchao Zhang;Fenglong Song;Nan Yuan;Wei Lin
Affiliations:
Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080
Venue:
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Year:
2008

Citing 6
Cited 3

Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
An experimental comparison of cache-oblivious and cache-conscious programs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Experience on optimizing irregular computation for memory hierarchy in manycore architecture

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The potential of on-chip multiprocessing for QCD machines

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Characterizing and Understanding the Bandwidth Behavior of Workloads on Multi-core Processors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth Band on-chip SRAM capacity C, while the output is maximum core number Pmax. We show that $P_{max}=\Theta(B\ast \sqrt{C})$. Pmaxindicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores PPmax. The model is validated by a comparison between the theoretical performance and experimental data of previous works.