Optimized dense matrix multiplication on a many-core architecture

Authors:
Elkin Garcia;Ioannis E. Venetis;Rishi Khan;Guang R. Gao
Affiliations:
Computer Architecture and Parallel Systems Laboratory, Department of Electrical and Computer Engineering, University of Delaware, Newark;Department of Computer Engineering and Informatics, University of Patras, Rion, Greece;ET International, Newark;Computer Architecture and Parallel Systems Laboratory, Department of Electrical and Computer Engineering, University of Delaware, Newark
Venue:
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Year:
2010

Citing 13
Cited 5

Matrix multiplication via arithmetic progressions

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Exploiting fast matrix multiplication within the level 3 BLAS

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Data cache performance of supercomputer applications

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Using Strassen's algorithm to accelerate the solution of linear systems

The Journal of Supercomputing
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Generalized Cannon's algorithm for parallel matrix multiplication

ICS '97 Proceedings of the 11th international conference on Supercomputing
Introduction to Algorithms

Introduction to Algorithms
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Mapping the LU decomposition on a many-core architecture: challenges and solutions

Proceedings of the 6th ACM conference on Computing frontiers
The Green500 List: Year one

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Mapping the FDTD Application to Many-Core Chip Architectures

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Locality optimization of stencil applications using data dependency graphs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Toward high-throughput algorithms on many-core architectures

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures

Proceedings of the 9th conference on Computing Frontiers
Strategies for improving performance and energy efficiency on a many-core

Proceedings of the ACM International Conference on Computing Frontiers
High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of many-core-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures. In this paper, we use dense matrix multiplication as a case of study to present a general methodology to map applications to these kinds of architectures. Our methodology exposes the following characteristics: (1) Balanced distribution of work among threads to fully exploit available resources. (2) Optimal register tiling and sequence of traversing tiles, calculated analytically and parametrized according to the register file size of the processor used. This results in minimal memory transfers and optimal register usage. (3) Implementation of architecture specific optimizations to further increase performance. Our experimental evaluation on a real C64 chip shows a performance of 44.12 GFLOPS, which corresponds to 55.2% of the peak performance of the chip. Additionally, measurements of power consumption prove that the C64 is very power efficient providing 530 MFLOPS/W for the problem under consideration.