High-performance optimizations on tiled many-core embedded systems: a matrix multiplication case study

Authors:
Arslan Munir;Farinaz Koushanfar;Ann Gordon-Ross;Sanjay Ranka
Affiliations:
Department of Electrical and Computer Engineering, Rice University, Houston, USA;Department of Electrical and Computer Engineering, Rice University, Houston, USA;Department of Electrical and Computer Engineering, University of Florida, Gainesville, USA and NSF Center for High-Performance Reconfigurable Computing (CHREC), University of Florida, Gainesville, ...;Department of Computer and Information Science and Engineering, University of Florida, Gainesville, USA
Venue:
The Journal of Supercomputing
Year:
2013

Citing 20
Cited 0

Exploiting fast matrix multiplication within the level 3 BLAS

ACM Transactions on Mathematical Software (TOMS)
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Generalized Cannon's algorithm for parallel matrix multiplication

ICS '97 Proceedings of the 11th international conference on Supercomputing
Stream processor architecture

Stream processor architecture
Performance Considerations of Shared Virtual Memory Machines

IEEE Transactions on Parallel and Distributed Systems
SUMMA: Scalable Universal Matrix Multiplication Algorithm

SUMMA: Scalable Universal Matrix Multiplication Algorithm
Landing openMP on cyclops-64: an efficient mapping of openMP to a many-core system-on-a-chip

Proceedings of the 3rd conference on Computing frontiers
A 5-GHz Mesh Interconnect for a Teraflops Processor

IEEE Micro
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Programming the Intel 80-core network-on-a-chip terascale processor

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A view of the parallel computing landscape

Communications of the ACM - A View of Parallel Computing
Embedded Multicore Processors and Systems

IEEE Micro
High Performance Matrix Multiplication on Many Cores

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Multicore Processors and Systems

Multicore Processors and Systems
A cost-effective load-balancing policy for tile-based, massive multi-core packet processors

ACM Transactions on Embedded Computing Systems (TECS)
Characterization of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Optimized dense matrix multiplication on a many-core architecture

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Performance characteristics of OpenMP language constructs on a many-core-on-a-chip architecture

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Performance and programmability comparison between OpenMP and MPI implementations of a molecular modeling application

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

Technological advancements in the silicon industry, as predicted by Moore's law, have resulted in an increasing number of processor cores on a single chip, giving rise to multicore, and subsequently many-core architectures. This work focuses on identifying key architecture and software optimizations to attain high performance from tiled many-core architectures (TMAs)--an architectural innovation in the multicore technology. Although embedded systems design is traditionally power-centric, there has been a recent shift toward high-performance embedded computing due to the proliferation of compute-intensive embedded applications. The TMAs are suitable for these embedded applications due to low-power design features in many of these TMAs. We discuss the performance optimizations on a single tile (processor core) as well as parallel performance optimizations, such as application decomposition, cache locality, tile locality, memory balancing, and horizontal communication for TMAs. We elaborate compiler-based optimizations that are applicable to TMAs, such as function inlining, loop unrolling, and feedback-based optimizations. We present a case study with optimized dense matrix multiplication algorithms for Tilera's TILEPro64 to experimentally demonstrate the performance and performance per watt optimizations on TMAs. Our results quantify the effectiveness of algorithmic choices, cache blocking, compiler optimizations, and horizontal communication in attaining high performance and performance per watt on TMAs.