An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Data and computation transformations for multiprocessors
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Matrix computations (3rd ed.)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Computing the Singular-Value Decomposition on the ILLIAC IV
ACM Transactions on Mathematical Software (TOMS)
Communications of the ACM - Special issue on computer architecture
Trident: a scalable architecture for scalar, vector, and matrix operations
CRPIT '02 Proceedings of the seventh Asia-Pacific conference on Computer systems architecture
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Solving Linear Systems on Vector and Shared Memory Computers
Solving Linear Systems on Vector and Shared Memory Computers
A Simulation Study of Decoupled Vector Architectures
The Journal of Supercomputing
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Trident: Technology-Scalable Architecture for Data Parallel Applications
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
The Design of a Parallel Dense Linear Algebra Software Library: Reduction to Hessenberg, Trididgonal, and Bidiagonal Form
Vector microprocessors
Scalable vector media-processors for embedded systems
Scalable vector media-processors for embedded systems
Neural, Parallel & Scientific Computations
A highly efficient implementation of a backpropagation learning algorithm using matrix ISA
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
This paper discusses the parallel implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the LAPACK subroutine _GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP is used as a metric to evaluate the performance of the Trident processor. Our results show that the high-efficiency is attained by using as much as possible matrix-vector and matrix operations because of reducing the ratio of memory accesses to FLOP. On a 32K×32K matrix and 128 Trident lanes, the speedup of using matrix-vector operations on the standard Golub and Kahan algorithm over using only vector operations on one lane is around 190 times (superlinear) and on 128 lanes is around two times. However, using matrix operations on the _GEBRD subroutine gives speedup around 307 times (superlinear) over using vector operations on one lane, 3.2 times over using vector operations on 128 lanes, and 1.3 times over using matrix-vector operations.