An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Scientific computing on vector computers
Scientific computing on vector computers
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Code optimizers and register organizations for vector architectures
Code optimizers and register organizations for vector architectures
Matrix computations (3rd ed.)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
Solving Linear Systems on Vector and Shared Memory Computers
Solving Linear Systems on Vector and Shared Memory Computers
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
Complexity-effective superscalar processors
Complexity-effective superscalar processors
Vector microprocessors
Scalable vector media-processors for embedded systems
Scalable vector media-processors for embedded systems
Sourcebook of parallel computing
Sourcebook of parallel computing
Matrix bidiagonalization: implementation and evaluation on the Trident processor
Neural, Parallel & Scientific Computations
What can we gain by unfolding loops?
ACM SIGPLAN Notices
Hi-index | 0.00 |
Back Propagation (BP) training algorithm has received intensive research efforts to exploit its parallelism in order to reduce the training time for complex problems. A modified version of BP based on matrix-matrix multiplication was proposed for parallel processing. This paper discusses the implementation of Matrix Back Propagation (MBP) using scalar, vector, and matrix instruction set architecture (ISA). Besides, it shows that the performance of the MBP is improved by switching form scalar to vector ISA and form vector to matrix ISA. On a practical application, speech recognition, the speedup of training a neural network using unrolling scalar over scalar ISA is 1.83. On eight parallel lanes, the speedup of using vector, unrolling vector, and matrix ISA are respectively 10.33, 11.88, and 15.36, where the maximum theoretical speedup is 16. Our results show that the use of matrix ISA gives a performance close to the optimal because of reusing the loaded data, decreasing the loop overhead, and overlapping the memory operations by arithmetic operations.