Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
An experimental comparison of cache-oblivious and cache-conscious programs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Proceedings of the 34th annual international symposium on Computer architecture
Experience on optimizing irregular computation for memory hierarchy in manycore architecture
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The potential of on-chip multiprocessing for QCD machines
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Characterizing and Understanding the Bandwidth Behavior of Workloads on Multi-core Processors
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Landing stencil code on Godson-T
Journal of Computer Science and Technology
Hi-index | 0.00 |
Current many-core architectures (MCA) have much larger arithmetic to memory bandwidth ratio compared with traditional processors (vector, superscalar, and multi-core, etc). As a result, bandwidth has become an important performance bottleneck of MCA. Previous works have demonstrated promising performance of MCA for dense matrix operations. However, there is still little quantitative understanding of the relationship between performance of matrix computation kernels and the limited memory bandwidth. This paper presents a performance model for dense matrix multiplication (MM), LU and Cholesky decomposition. The input parameters are memory bandwidth Band on-chip SRAM capacity C, while the output is maximum core number Pmax. We show that $P_{max}=\Theta(B\ast \sqrt{C})$. Pmaxindicates that when the problem size is large enough, the given memory bandwidth will not be a performance bottleneck as long as the number of cores PPmax. The model is validated by a comparison between the theoretical performance and experimental data of previous works.