A shared matrix unit for a chip multi-core processor

Authors:
Mostafa I. Soliman;Abdulmajid F. Al-Junaid
Affiliations:
-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 19
Cited 0

Decoupled access/execute computer architectures

ACM Transactions on Computer Systems (TOCS)
A Single-Chip Multiprocessor

Computer
Decoupled vector architectures

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
The Future of Microprocessors

Queue - Multiprocessors
Heterogeneous Chip Multiprocessors

Computer
Design of a Computer—The Control Data 6600

Design of a Computer—The Control Data 6600
Design and evaluation of a hierarchical decoupled architecture

The Journal of Supercomputing
Transaction-Level Modeling with Systemc: Tlm Concepts and Applications for Embedded Systems

Transaction-Level Modeling with Systemc: Tlm Concepts and Applications for Embedded Systems
Parallel operation in the control data 6600

AFIPS '64 (Fall, part II) Proceedings of the October 27-29, 1964, fall joint computer conference, part II: very high speed computer systems
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
SystemC: From the Ground Up, Second Edition

SystemC: From the Ground Up, Second Edition
Simplified vector-thread architectures for flexible and efficient data-parallel accelerators

Simplified vector-thread architectures for flexible and efficient data-parallel accelerators
Bulldozer: An Approach to Multithreaded Compute Performance

IEEE Micro
Codevelopment of multi-level instruction set architecture and hardware for an efficient matrix processor

Neural, Parallel & Scientific Computations
Computer Architecture, Fifth Edition: A Quantitative Approach

Computer Architecture, Fifth Edition: A Quantitative Approach
Computer Organization and Design, Revised Fourth Edition, Fourth Edition: The Hardware/Software Interface

Computer Organization and Design, Revised Fourth Edition, Fourth Edition: The Hardware/Software Interface
Towards efficient GPU sharing on multicore processors

Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators

Concurrency and Computation: Practice & Experience
Mat-core: a decoupled matrix core extension for general-purpose processors

Neural, Parallel & Scientific Computations

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes extending a multi-core processor with a common matrix unit to maximize on-chip resource utilization and to leverage the advantages of the current multi-core revolution to improve the performance of data-parallel applications. Each core fetches scalar/vector/matrix instructions from its instruction cache. Scalar instructions continue the execution on the scalar datapath; however, vector/matrix instructions are issued by the decode stage to the shared matrix unit through the corresponding FIFO queue. Moreover, scalar results from reduction vector/matrix instructions are sent back from the matrix unit to the scalar core that sent these instructions. Some dense linear algebra kernels (scalar-vector multiplication, scalar times vector plus another, apply Givens rotation, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication) as well as discrete cosine transform, sum of absolute differences, and affine transformation are used in the performance evaluation. Our results show that the improvement in the utilization of the shared matrix unit with a dual-core ranges from 9% to 26% compared to extending a matrix unit to a single-core. Moreover, the average speedup of the dual-core shared matrix unit over a single-core extended with a matrix unit ranges from 6% to 24% and the maximum speedup ranges from 13% to 46%.