HPSm, a high performance restricted data flow architecture having minimal functionality
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
HPS, a new microarchitecture: rationale and introduction
MICRO 18 Proceedings of the 18th annual workshop on Microprogramming
Hardware support for large atomic units in dynamically scheduled machines
MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Improving CISC instruction decoding performance using a fill unit
Proceedings of the 28th annual international symposium on Microarchitecture
Communications of the ACM
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Stage-skip pipeline: a low power processor architecture using a decoded instruction buffer
ISLPED '96 Proceedings of the 1996 international symposium on Low power electronics and design
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences
Proceedings of the 24th annual international symposium on Computer architecture
Exploiting instruction level parallelism in processors by caching scheduled groups
Proceedings of the 24th annual international symposium on Computer architecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Algorithm 419: zeros of a complex polynomial [C2]
Communications of the ACM
Hi-index | 0.00 |
In this paper, we consider the increased performance that canbe obtained by using, in concert, three previously proposedenhancements. These enhancements are aggressive dynamic (run time)instruction scheduling, the reuse of decoded instructions, and tracescheduling (both aggressive dynamic instruction scheduling anddecoded instruction reuse have been used in commercial systems). Weshow that these three enhancements complement and support oneanother. Hence, while each of these enhancements has been shown tohave merit in its own right, when used in concert, we claim theoverall advantage is greater than that obtained by using any onesingly. To support this claim, we present the results from runningbenchmarks representing several common multimedia kernels.Subsequent simulations show results of 7.3 instructions completed percycle for the best-performing benchmark for a reasonably aggressivemicroarchitecture that combines trace scheduling of decodedinstructions (i.e., decoded traces) with aggressive dynamicexecution.