Aggressive Dynamic Execution of Multimedia Kernel Traces

Authors:
B. Bishop
Affiliations:
-
Venue:
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Year:
1998

Citing 10
Cited 0

HPSm, a high performance restricted data flow architecture having minimal functionality

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
HPS, a new microarchitecture: rationale and introduction

MICRO 18 Proceedings of the 18th annual workshop on Microprogramming
Hardware support for large atomic units in dynamically scheduled machines

MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Improving CISC instruction decoding performance using a fill unit

Proceedings of the 28th annual international symposium on Microarchitecture
Intel MMX for multimedia PCs

Communications of the ACM
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Stage-skip pipeline: a low power processor architecture using a decoded instruction buffer

ISLPED '96 Proceedings of the 1996 international symposium on Low power electronics and design
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Proceedings of the 24th annual international symposium on Computer architecture
Exploiting instruction level parallelism in processors by caching scheduled groups

Proceedings of the 24th annual international symposium on Computer architecture
Algorithm 419: zeros of a complex polynomial [C2]

Communications of the ACM

Quantified Score

Hi-index	0.00

Visualization

Abstract

There has been relatively little analytical work on processor optimizations for multimedia applications. With the introduction of MMX by Intel, it is clear that this is an area of increasing importance. Building on previous work [4, 5, 6, 7, 13, 14], we propose optimizations for multimedia architectures that support independent parallel execution of instructions within dynamically assembled traces, resulting in dramatic performance improvements.Specifically, we propose simplified instruction scheduling and register renaming algorithms due to constraints on trace formation. In addition, we suggest specific instruction pool and trace cache parameters. We constructed a simulator in order to measure the benefits of these processor optimizations for multimedia applications. The simulated machine, which could fetch/decode 2 instructions per cycle, performed better than a superscalar machine that could fetch/decode 8 instructions per cycle. Execution rates as high as 7.3 instructions per cycle were achieved for the benchmarks simulated, assuming 16 instructions per trace.