A highly efficient implementation of back propagation algorithm using matrix instruction set architecture

Authors:
Mostafa I. Soliman;Samir A. Mohamed
Affiliations:
Computer & Control Section, Electrical Engineering Department, Faculty of Engineering, South Valey University, Aswan, Arab Republic of Egypt;Computer & Control Section, Electrical Engineering Department, Faculty of Engineering, South Valey University, Aswan, Arab Republic of Egypt
Venue:
Neural, Parallel & Scientific Computations
Year:
2007

Citing 18
Cited 0

An extended set of FORTRAN basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Scientific computing on vector computers

Scientific computing on vector computers
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Computer Technology and Architecture: An Evolving Interaction

Computer
Code optimizers and register organizations for vector architectures

Code optimizers and register organizations for vector architectures
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
Solving Linear Systems on Vector and Shared Memory Computers

Solving Linear Systems on Vector and Shared Memory Computers
The μVP 64-Bit Vector Coprocessor: A New Implementation of High-Performance Numerical Computation

IEEE Micro
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Computer Architecture: A Quantitative Approach

Computer Architecture: A Quantitative Approach
Complexity-effective superscalar processors

Complexity-effective superscalar processors
Vector microprocessors

Vector microprocessors
Scalable vector media-processors for embedded systems

Scalable vector media-processors for embedded systems
Sourcebook of parallel computing

Sourcebook of parallel computing
Matrix bidiagonalization: implementation and evaluation on the Trident processor

Neural, Parallel & Scientific Computations
What can we gain by unfolding loops?

ACM SIGPLAN Notices

Quantified Score

Hi-index	0.00

Visualization

Abstract

Back Propagation (BP) training algorithm has received intensive research efforts to exploit its parallelism in order to reduce the training time for complex problems. A modified version of BP based on matrix-matrix multiplication was proposed for parallel processing. This paper discusses the implementation of Matrix Back Propagation (MBP) using scalar, vector, and matrix instruction set architecture (ISA). Besides, it shows that the performance of the MBP is improved by switching form scalar to vector ISA and form vector to matrix ISA. On a practical application, speech recognition, the speedup of training a neural network using unrolling scalar over scalar ISA is 1.83. On eight parallel lanes, the speedup of using vector, unrolling vector, and matrix ISA are respectively 10.33, 11.88, and 15.36, where the maximum theoretical speedup is 16. Our results show that the use of matrix ISA gives a performance close to the optimal because of reusing the loaded data, decreasing the loop overhead, and overlapping the memory operations by arithmetic operations.