A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Introduction to VLSI Systems
A Family of High-Performance Matrix Multiplication Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
A Flexible Class of Parallel Matrix Multiplication Algorithms
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
Proceedings of the 31st annual international symposium on Computer architecture
Compiling for EDGE Architectures
Proceedings of the International Symposium on Code Generation and Optimization
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scheduling threads for constructive cache sharing on CMPs
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
The coming wave of multithreaded chip multiprocessors
International Journal of Parallel Programming
Anatomy of high-performance matrix multiplication
ACM Transactions on Mathematical Software (TOMS)
Toward scalable matrix multiply on multithreaded architectures
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
An evaluation of the TRIPS computer system
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
High Performance Matrix Multiplication on Many Cores
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A programmable parallel accelerator for learning and classification
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
As technology trends have limited the performance scaling of conventional processors, industry and academic research has turned to parallel architectures on a single chip, including distributed uniprocessors and multicore chips. This paper examines how to extend the archtypical operation of dense linear algebra, matrix multiply, to an emerging class of uniprocessor architectures characterized by a large number of independent functional units, register banks, and cache banks connected by a 2-D on-chip network. We extend the well known algorithm for matrix multiplication by Goto to this spatially distributed class of uniprocessor and describe the optimizations of the innermost kernel, a systolic-like algorithm running on a general purpose uniprocessor. The resulting implementation yields the first demonstration of high-performance in an application executing on the TRIPS processor hardware, a next-generation distributed processor core. We show that such processors are indeed capable of substantial improvements in single threaded performance provided their spatial topography is taken into account.