A Performance Model of Dense Matrix Operations on Many-Core Architectures

  • Authors:
  • Guoping Long;Dongrui Fan;Junchao Zhang;Fenglong Song;Nan Yuan;Wei Lin

  • Affiliations:
  • Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 100080

  • Venue:
  • Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth Band on-chip SRAM capacity C, while the output is maximum core number Pmax. We show that $P_{max}=\Theta(B\ast \sqrt{C})$. Pmaxindicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores PPmax. The model is validated by a comparison between the theoretical performance and experimental data of previous works.