Optimization of instruction fetch mechanisms for high issue rates
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data caches for superscalar processors
ICS '97 Proceedings of the 11th international conference on Supercomputing
On high-bandwidth data cache design for multi-issue processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Out-of-order vector architectures
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Adding a vector unit to a superscalar processor
ICS '99 Proceedings of the 13th international conference on Supercomputing
Exploiting a new level of DLP in multimedia applications
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Hi-index | 0.00 |
This paper analyzes the performance of vector-dominated regions of code in numerical and multimedia applications in a superscalar+vector architecture and compares it to an 8-way superscalar processor. The ability to split a program's execution into scalar and vector regions allows us to show that (1) as expected, the vector unit is much better than the wide issue superscalar at executing the vector-dominated regions of the code; (2) on the scalar regions, the 8-way superscalar, although better than a 4-way superscalar, is clearly not worth the extra complexity in terms of extra transistors and potential cycle time limitations. Overall, the vector-enhanced superscalar is from 6% to 303% better than an 8-way superscalar. We also present detailed data on the performance of the memory system, which is usually the key limiting factor when running numerical and multimedia applications. We evaluate two additional cache designs that try to alleviate problems created by non-unit stride memory references.