Communications of the ACM - Special issue on parallelism
Scientific computing on vector computers
Scientific computing on vector computers
Strip mining on SIMD architectures
ICS '91 Proceedings of the 5th international conference on Supercomputing
A study of partitioned vector register files
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Code optimizers and register organizations for vector architectures
Code optimizers and register organizations for vector architectures
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Initial results on the performance and cost of vector microprocessors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Decoupled access/execute computer architectures
ACM Transactions on Computer Systems (TOCS)
Reducing the complexity of the register file in dynamic superscalar processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Banked multiported register files for high-frequency superscalar microprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Vector microprocessors
Scalable vector media-processors for embedded systems
Scalable vector media-processors for embedded systems
Sourcebook of parallel computing
Sourcebook of parallel computing
Universal Mechanisms for Data-Parallel Architectures
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Programming With Hyper-Threading Technology
Programming With Hyper-Threading Technology
Design and evaluation of a hierarchical decoupled architecture
The Journal of Supercomputing
Computer Architecture, Fourth Edition: A Quantitative Approach
Computer Architecture, Fourth Edition: A Quantitative Approach
Tradeoff between data-, instruction-, and thread-level parallelism in stream processors
Proceedings of the 21st annual international conference on Supercomputing
Vector-thread architecture and implementation
Vector-thread architecture and implementation
Low-complexity vector microprocessor extension
Low-complexity vector microprocessor extension
A shared matrix unit for a chip multi-core processor
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
This paper proposes new processor architecture to exploit the increasingly number of transistors per integrated circuit and improve the performance of many applications on general-purpose processors. The proposed processor (called Mat-Core) is based on the use of multi-level ISA to explicitly communicate data parallelism to processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques. Scalar-scalar (level-O), scalar-vector (level-l), vector-vector (level-l), vector-matrix (level-2), and matrix-matrix (level-3) instruction sets are used as a multi-level interface between hardware and software. Mat-Core extends a general-purpose scalar processor (for executing scalar instructions) with a matrix unit (for executing vector/matrix instructions). To tolerate the memory latency, the extended matrix unit is decoupled into two components: address generation and data computation. The data computation unit is organized in parallel lanes; each lane contains a pipeline of each functional unit and a slice of the matrix register file. On parallel lanes, the Mat-Core processor can effectively process not only vector but also matrix data. This paper explains the execution of vector/matrix instructions on the parallel lanes of Mat-Core. Moreover, the performances of element-wise vector-vector addition, vector-matrix multiplication, and matrix-matrix multiplication are estimated on the decoupled Mat-Core processor. The increasingly budget of transistors can be exploiting to scale the Mat-core processor by providing more cores in a physical package. On a Multi-Mat-Core processor, performance would be improved by parallel processing threads of codes using multi-threading techniques.