Strip mining on SIMD architectures
ICS '91 Proceedings of the 5th international conference on Supercomputing
Code optimizers and register organizations for vector architectures
Code optimizers and register organizations for vector architectures
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Matrix computations (3rd ed.)
Vector architectures: past, present and future
ICS '98 Proceedings of the 12th international conference on Supercomputing
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Decoupled vector architectures
HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Vector microprocessors
Scalable vector media-processors for embedded systems
Scalable vector media-processors for embedded systems
Computer Architecture, Fourth Edition: A Quantitative Approach
Computer Architecture, Fourth Edition: A Quantitative Approach
A shared matrix unit for a chip multi-core processor
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
The instruction set architecture (ISA) is the part of the processor that is visible to the programmer or compiler writer. Multi-level ISA is proposed to explicitly communicate data parallelism to hardware (processor) in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques. This paper presents the codevelopment of multi-level ISA and hardware for an efficient matrix processor called Mat-Core. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute scalar-matrix, vector-matrix, and matrix-matrix instructions in addition to scalarvector and vector-vector instructions. Mat-Core leads to a compiler model that is efficient both in terms of performance and executable code size. On four parallel lanes Mat-Core and matrix registers of size 8×4 or 32 elements, our results show performances of about 1.6, 2.1, 4.1, and 6.4 FLOPs per clock cycle achieved on scalar-vector multiplication, SAXPY, vector-matrix multiplication, and matrix-matrix multiplication, respectively.