Codevelopment of multi-level instruction set architecture and hardware for an efficient matrix processor

Authors:
Mostafa I. Soliman;Abdulmajid F. Al-Junaid
Affiliations:
Electrical Engineering Department, Faculty of Engineering, South Valley University, Aswan, Egypt;Electrical Engineering Department, Faculty of Engineering, South Valley University, Aswan, Egypt
Venue:
Neural, Parallel & Scientific Computations
Year:
2010

Citing 12
Cited 1

Strip mining on SIMD architectures

ICS '91 Proceedings of the 5th international conference on Supercomputing
Code optimizers and register organizations for vector architectures

Code optimizers and register organizations for vector architectures
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Vector architectures: past, present and future

ICS '98 Proceedings of the 12th international conference on Supercomputing
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
How Multimedia Workloads Will Change Processor Design

Computer
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Decoupled vector architectures

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Vector microprocessors

Vector microprocessors
Scalable vector media-processors for embedded systems

Scalable vector media-processors for embedded systems
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach

A shared matrix unit for a chip multi-core processor

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The instruction set architecture (ISA) is the part of the processor that is visible to the programmer or compiler writer. Multi-level ISA is proposed to explicitly communicate data parallelism to hardware (processor) in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques. This paper presents the codevelopment of multi-level ISA and hardware for an efficient matrix processor called Mat-Core. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute scalar-matrix, vector-matrix, and matrix-matrix instructions in addition to scalarvector and vector-vector instructions. Mat-Core leads to a compiler model that is efficient both in terms of performance and executable code size. On four parallel lanes Mat-Core and matrix registers of size 8×4 or 32 elements, our results show performances of about 1.6, 2.1, 4.1, and 6.4 FLOPs per clock cycle achieved on scalar-vector multiplication, SAXPY, vector-matrix multiplication, and matrix-matrix multiplication, respectively.