High Performance Matrix Multiplication on Many Cores

Authors:
Nan Yuan;Yongbin Zhou;Guangming Tan;Junchao Zhang;Dongrui Fan
Affiliations:
Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P. R. China 100190 and Graduate University of Chinese Academy of Scienc ...;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P. R. China 100190 and Graduate University of Chinese Academy of Scienc ...;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P. R. China 100190;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P. R. China 100190;Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P. R. China 100190
Venue:
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Year:
2009

Citing 12
Cited 4

The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
A comparative study of arbitration algorithms for the Alpha 21364 pipelined router

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
The Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Scaling to the End of Silicon with EDGE Architectures

Computer
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Experience on optimizing irregular computation for memory hierarchy in manycore architecture

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Programming the Intel 80-core network-on-a-chip terascale processor

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Efficient Parallelization of a Protein Sequence Comparison Algorithm on Manycore Architecture

PDCAT '08 Proceedings of the 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Extendable pattern-oriented optimization directives

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)
StreamTMC: Stream compilation for tiled multi-core architectures

Journal of Parallel and Distributed Computing
High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Moore's Law suggests that the number of processing cores on a single chip increases exponentially. The future performance increases will be mainly extracted from thread-level parallelism exploited by multi/many-core processors (MCP). Therefore, it is necessary to find out how to build the MCP hardware and how to program the parallelism on such MCP. In this work, we intend to identity the key architecture mechanisms and software optimizations to guarantee high performance for multithreaded programs. To illustrate this, we customize a dense matrix multiplication algorithm on Godson-T MCP as a case study to demonstrate the efficient synergy and interaction between hardware and software. Experiments conducted on the cycle-accurate simulator show that the optimized matrix multiplication could obtain 97.1% (124.3GFLOPS) of the peak performance of Godson-T.