High performance dense linear algebra on a spatially distributed processor

Authors:
Jeffrey R. Diamond;Behnam Robatmili;Stephen W. Keckler;Robert van de Geijn;Kazushige Goto;Doug Burger
Affiliations:
The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA;The University of Texas at Austin, Austin, TX, USA
Venue:
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Year:
2008

Citing 17
Cited 4

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Introduction to VLSI Systems

Introduction to VLSI Systems
Baring It All to Software: Raw Machines

Computer
The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs

IEEE Micro
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
A Flexible Class of Parallel Matrix Multiplication Algorithms

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Scaling to the End of Silicon with EDGE Architectures

Computer
Compiling for EDGE Architectures

Proceedings of the International Symposium on Code Generation and Optimization
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scheduling threads for constructive cache sharing on CMPs

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
The coming wave of multithreaded chip multiprocessors

International Journal of Parallel Programming
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

An evaluation of the TRIPS computer system

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
High Performance Matrix Multiplication on Many Cores

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A programmable parallel accelerator for learning and classification

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

As technology trends have limited the performance scaling of conventional processors, industry and academic research has turned to parallel architectures on a single chip, including distributed uniprocessors and multicore chips. This paper examines how to extend the archtypical operation of dense linear algebra, matrix multiply, to an emerging class of uniprocessor architectures characterized by a large number of independent functional units, register banks, and cache banks connected by a 2-D on-chip network. We extend the well known algorithm for matrix multiplication by Goto to this spatially distributed class of uniprocessor and describe the optimizations of the innermost kernel, a systolic-like algorithm running on a general purpose uniprocessor. The resulting implementation yields the first demonstration of high-performance in an application executing on the TRIPS processor hardware, a next-generation distributed processor core. We show that such processors are indeed capable of substantial improvements in single threaded performance provided their spatial topography is taken into account.