Communications of the ACM
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Performance of image and video processing with general-purpose processors and media ISA extensions
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Cache performance for multimedia applications
ICS '01 Proceedings of the 15th international conference on Supercomputing
Multimedia Execution Hardware Accelerator
Journal of VLSI Signal Processing Systems - Parallel VLSI architectures for image and video processing
Automatic intra-register vectorization for the Intel architecture
International Journal of Parallel Programming
Internet Streaming SIMD Extensions
Computer
Measuring the Performance of Multimedia Instruction Sets
IEEE Transactions on Computers
Implementation and Evaluation of the Complex Streamed Instruction Set
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Performance Scalability of Multimedia Instruction Set Extensions
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
The CSI multimedia architecture
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Hi-index | 0.00 |
The Complex Streamed Instruction (CSI) set is an instruction set extension targeted at multimedia applications. CSI instructions process two-dimensional data streams stored in memory and the streams can be of any length. Sectioning (the process of splitting up arbitrary-length streams into fixed-size sections that fit in a vector register), data alignment, and conversion between different packed data types are all performed in hardware. It has been shown previously that CSI provides significant speedups compared to current media ISA extensions such as MMX and VIS. This paper presents a detailed design of a unit that can execute CSI instructions under the assumption that it is interfaced with the first-level data cache. In particular, it is shown that the complex, two-dimensional, address-generation calculations can be performed in a pipelined fashion and implemented using a three-stage pipeline with acceptable delay and hardware cost.