Matrix multiplication via arithmetic progressions
STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Exploiting fast matrix multiplication within the level 3 BLAS
ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Data cache performance of supercomputer applications
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Using Strassen's algorithm to accelerate the solution of linear systems
The Journal of Supercomputing
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Generalized Cannon's algorithm for parallel matrix multiplication
ICS '97 Proceedings of the 11th international conference on Supercomputing
Introduction to Algorithms
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Mapping the LU decomposition on a many-core architecture: challenges and solutions
Proceedings of the 6th ACM conference on Computing frontiers
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Mapping the FDTD Application to Many-Core Chip Architectures
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Locality optimization of stencil applications using data dependency graphs
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Toward high-throughput algorithms on many-core architectures
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Proceedings of the 9th conference on Computing Frontiers
Strategies for improving performance and energy efficiency on a many-core
Proceedings of the ACM International Conference on Computing Frontiers
The Journal of Supercomputing
Hi-index | 0.00 |
Traditional parallel programming methodologies for improving performance assume cache-based parallel systems. However, new architectures, like the IBM Cyclops-64 (C64), belong to a new set of many-core-on-a-chip systems with a software managed memory hierarchy. New programming and compiling methodologies are required to fully exploit the potential of this new class of architectures. In this paper, we use dense matrix multiplication as a case of study to present a general methodology to map applications to these kinds of architectures. Our methodology exposes the following characteristics: (1) Balanced distribution of work among threads to fully exploit available resources. (2) Optimal register tiling and sequence of traversing tiles, calculated analytically and parametrized according to the register file size of the processor used. This results in minimal memory transfers and optimal register usage. (3) Implementation of architecture specific optimizations to further increase performance. Our experimental evaluation on a real C64 chip shows a performance of 44.12 GFLOPS, which corresponds to 55.2% of the peak performance of the chip. Additionally, measurements of power consumption prove that the C64 is very power efficient providing 530 MFLOPS/W for the problem under consideration.