VLSI array processors
An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Principles and Practices of Interconnection Networks
Principles and Practices of Interconnection Networks
Matrix Transpose on 2D Torus Array Processor
CIT '06 Proceedings of the Sixth IEEE International Conference on Computer and Information Technology
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
The general matrix multiply-add operation on 2D torus
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.00 |
As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to general-purpose processors in order to handle the different workloads of scientific and signal processing applications. Many kernels in these applications lend themselves to the data-parallel architectures such as array processors. The basic linear algebra subroutines (BLAS) are standard operations to efficiently solve the linear algebra problems on high performance and parallel systems. In this paper, we implement and evaluate the performance of some important BLAS operations on a matrix coprocessor. Our analytical model shows the performance of the Level-3 BLAS represented by the n×n matrix multiply-add operation approaches the theoretical peak as n increases since the degree of data reuse is high. However, the performance of Level-1 and Level-2 BLAS operations is low as a result of low data reuse. Fortunately, many applications are based on intensive use of Level- 3 BLAS with small percentage of Level-1 and Level-2 BLAS.