Available instruction-level parallelism for superscalar and superpipelined machines
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Code optimizers and register organizations for vector architectures
Code optimizers and register organizations for vector architectures
Designing the TFP Microprocessor
IEEE Micro
Evaluation of design alternatives for a multiprocessor microprocessor
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Initial results on the performance and cost of vector microprocessors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
A Chip-Multiprocessor Architecture with Speculative Multithreading
IEEE Transactions on Computers
Exploiting a new level of DLP in multimedia applications
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Variability in the execution of multimedia applications and implications for architecture
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
A Simulation Study of Decoupled Vector Architectures
The Journal of Supercomputing
Billion-Transistor Architectures
Computer
Computer
Subword Parallelism with MAX-2
IEEE Micro
IEEE Micro
The visual instruction set (VIS) in UltraSPARC
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
Vector microprocessors
Matrix bidiagonalization: implementation and evaluation on the Trident processor
Neural, Parallel & Scientific Computations
Hi-index | 0.00 |
Within a few years it will be possible to integrate a billion transistors on a single chip. At this integration level, we propose using a high level ISA to express parallelism to hardware instead of using a huge transistor budget to dynamically extract it. Since the fundamental data structures for a wide variety of applications are scalar, vector, and matrix, our proposed Trident processor extends the classical vector ISA with matrix operations. The Trident processor consists of a set of parallel vector pipelines (PVPs) combined with a fast in order scalar core. The PVPs can access both vector and matrix register files to perform vector, matrix, and matrix-vector operations. One key point of our design is the exploitation of up to three levels of data parallelism. Another key point is the ring register files for storing vector and matrix data. The ring structure of the register files reduces the number and size of the address decoders, the number of ports, the area overhead caused by the address bus, and the number of registers attached to bit lines, as well as providing local communication between PVPs. The scalability of the Trident processor does not require more fetch, decode, or issue bandwidth, but requires replication of PVPs and increasing the register file size. Scientific, engineering, multimedia, and many other applications, which are based on a mixture of scalar, vector, and matrix operations, can be speeded up on the Trident processor.