An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Scientific computing on vector computers
Scientific computing on vector computers
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Code optimizers and register organizations for vector architectures
Code optimizers and register organizations for vector architectures
Matrix computations (3rd ed.)
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
Solving Linear Systems on Vector and Shared Memory Computers
Solving Linear Systems on Vector and Shared Memory Computers
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
Complexity-effective superscalar processors
Complexity-effective superscalar processors
Vector microprocessors
Scalable vector media-processors for embedded systems
Scalable vector media-processors for embedded systems
Sourcebook of parallel computing
Sourcebook of parallel computing
Matrix bidiagonalization: implementation and evaluation on the Trident processor
Neural, Parallel & Scientific Computations
What can we gain by unfolding loops?
ACM SIGPLAN Notices
Parallel Implementation of Back-Propagation Algorithm in Networks of Workstations
IEEE Transactions on Parallel and Distributed Systems
ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 02
Hi-index | 0.00 |
BackPropagation (BP) is the most famous learning algorithm for Artificial Neural Networks (ANN). BP has received intensive research efforts to exploit its parallelism in order to reduce the training time for complex problems. A modified version of BP based on matrix-matrix multiplication was proposed for parallel processing. In this paper, we present the implementation of Matrix BackPropagation (MBP) using scalar, vector, and matrix Instruction Set Architectures (ISAs). Besides this, we show that the performance of the MBP is improved by switching from scalar ISA to vector ISA. It is further improved by switching from vector ISA to matrix ISA. On a practical application, speech recognition, the speedup of training a neural network using unrolling scalar ISA over scalar ISA is 1.83. On eight parallel lanes, the speedups of using vector, unrolling vector, and matrix ISAs are respectively 10.33, 11.88, and 15.36, where the maximum theoretical speedup is 16. The results obtained show that the use of matrix ISA gives a performance close to optimal, because of reusing the loaded data, decreasing the loop overhead, and overlapping the memory operations with arithmetic operations.