Mat-core: a decoupled matrix core extension for general-purpose processors

Authors:
Mostafa I. Soliman
Affiliations:
Computer & System Section, Electrical Engineering Department, Aswan Faculty of Engineering, South Valley University, Aswan, Egypt
Venue:
Neural, Parallel & Scientific Computations
Year:
2011

Citing 23
Cited 1

Data parallel algorithms

Communications of the ACM - Special issue on parallelism
Scientific computing on vector computers

Scientific computing on vector computers
Strip mining on SIMD architectures

ICS '91 Proceedings of the 5th international conference on Supercomputing
A study of partitioned vector register files

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Code optimizers and register organizations for vector architectures

Code optimizers and register organizations for vector architectures
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Initial results on the performance and cost of vector microprocessors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
How Multimedia Workloads Will Change Processor Design

Computer
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Banked multiported register files for high-frequency superscalar microprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Vector microprocessors

Vector microprocessors
Scalable vector media-processors for embedded systems

Scalable vector media-processors for embedded systems
Sourcebook of parallel computing

Sourcebook of parallel computing
Universal Mechanisms for Data-Parallel Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Programming With Hyper-Threading Technology

Programming With Hyper-Threading Technology
Design and evaluation of a hierarchical decoupled architecture

The Journal of Supercomputing
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors

Proceedings of the 21st annual international conference on Supercomputing
Vector-thread architecture and implementation

Vector-thread architecture and implementation
Low-complexity vector microprocessor extension

Low-complexity vector microprocessor extension

A shared matrix unit for a chip multi-core processor

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes new processor architecture to exploit the increasingly number of transistors per integrated circuit and improve the performance of many applications on general-purpose processors. The proposed processor (called Mat-Core) is based on the use of multi-level ISA to explicitly communicate data parallelism to processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques. Scalar-scalar (level-O), scalar-vector (level-l), vector-vector (level-l), vector-matrix (level-2), and matrix-matrix (level-3) instruction sets are used as a multi-level interface between hardware and software. Mat-Core extends a general-purpose scalar processor (for executing scalar instructions) with a matrix unit (for executing vector/matrix instructions). To tolerate the memory latency, the extended matrix unit is decoupled into two components: address generation and data computation. The data computation unit is organized in parallel lanes; each lane contains a pipeline of each functional unit and a slice of the matrix register file. On parallel lanes, the Mat-Core processor can effectively process not only vector but also matrix data. This paper explains the execution of vector/matrix instructions on the parallel lanes of Mat-Core. Moreover, the performances of element-wise vector-vector addition, vector-matrix multiplication, and matrix-matrix multiplication are estimated on the decoupled Mat-Core processor. The increasingly budget of transistors can be exploiting to scale the Mat-core processor by providing more cores in a physical package. On a Multi-Mat-Core processor, performance would be improved by parallel processing threads of codes using multi-threading techniques.