High-Performance matrix multiply on a massively multithreaded fiteng1000 processor

Authors:
Jie Liu;Lihua Chi;Chunye Gong;Han Xu;Jie Jiang;Yihui Yan;Qingfeng Hu
Affiliations:
Section 605, College of Computer Science, National University of Defense Technology, Changsha, China;Section 605, College of Computer Science, National University of Defense Technology, Changsha, China;Section 605, College of Computer Science, National University of Defense Technology, Changsha, China;Section 605, College of Computer Science, National University of Defense Technology, Changsha, China;Section 605, College of Computer Science, National University of Defense Technology, Changsha, China;Section 605, College of Computer Science, National University of Defense Technology, Changsha, China;Section 605, College of Computer Science, National University of Defense Technology, Changsha, China
Venue:
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
Year:
2012

Citing 7
Cited 0

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
A family of high-performance matrix multiplication algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Matrix multiplication is an essential building block of many linear algebra operations and applications. This paper presents parallel algorithms with shared A or B matrix in the memory for the special massively multithreaded Fiteng1000 processor. We discuss the implementations of parallel matrix multiplication algorithms on the multi-core processor with many threads. To gain better performance, it is important to choose the 2D thread spatial topography, the memory layer for the placement and the sizes of the matrices. Parallel codes using C and assembly language under OpenMP parallel programming environment are designed. Performance results on Fiteng1000 processor show that the algorithms have well good parallel performance and achieve near-peak performance.