Hardware support for large atomic units in dynamically scheduled machines

Authors:
S. W. Melvin;M. C. Shebanow;Y. N. Patt
Affiliations:
Computer Science Division, University of California, Berkeley, Berkeley, CA;Computer Science Division, University of California, Berkeley, Berkeley, CA;Computer Science Division, University of California, Berkeley, Berkeley, CA
Venue:
MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Year:
1988

Citing 3
Cited 29

HPSm, a high performance restricted data flow architecture having minimal functionality

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Run-time generation of HPS microinstructions from a VAX instruction stream

MICRO 19 Proceedings of the 19th annual workshop on Microprogramming
Checkpoint repair for high-performance out-of-order execution machines

IEEE Transactions on Computers

High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Exploiting fine-grained parallelism through a combination of hardware and software techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
A fill-unit approach to multiple instruction issue

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Dynamic rescheduling: a technique for object code compatibility in VLIW architectures

Proceedings of the 28th annual international symposium on Microarchitecture
Improving CISC instruction decoding performance using a fill unit

Proceedings of the 28th annual international symposium on Microarchitecture
Importance of profiling and compatibility

ACM Computing Surveys (CSUR) - Special issue: position statements on strategic directions in computing research
A persistent rescheduled-page cache for low overhead object code compatibility in VLIW architectures

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Integrating a misprediction recovery cache (MRC) into a superscalar pipeline

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Instruction fetch mechanisms for VLIW architectures with compressed encodings

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Proceedings of the 24th annual international symposium on Computer architecture
Exploiting instruction level parallelism in processors by caching scheduled groups

Proceedings of the 24th annual international symposium on Computer architecture
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings of the 24th annual international symposium on Computer architecture
Alternative fetch and issue policies for the trace cache fetch mechanism

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
On high-bandwidth data cache design for multi-issue processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The effect of instruction fetch bandwidth on value prediction

Proceedings of the 25th annual international symposium on Computer architecture
Improving trace cache effectiveness with branch promotion and trace packing

Proceedings of the 25th annual international symposium on Computer architecture
A Trace Cache Microarchitecture and Evaluation

IEEE Transactions on Computers - Special issue on cache memory and related problems
MPS: Miss-Path Scheduling for Multiple-Issue Processors

IEEE Transactions on Computers
Performance benefits of large execution atomic units in dynamically scheduled machines

ICS '89 Proceedings of the 3rd international conference on Supercomputing
Aggressive Dynamic Execution of Decoded Traces

Journal of VLSI Signal Processing Systems - Special issue on the 1997 IEEE workshop on signal processing systems (SiPS): design and implementation
Properties of Rescheduling Size Invariance for Dynamic Rescheduling-Based VLIW Cross-Generation Compatibility

IEEE Transactions on Computers
Boosting trace cache performance with nonhead miss speculation

ICS '02 Proceedings of the 16th international conference on Supercomputing
Guest Editor's Introduction Real Machines: Design Choices/Engineering Trade-Offs

Computer
Trace Processors: Moving to Fourth-Generation Microarchitectures

Computer
Exploiting compiler-generated schedules for energy savings in high-performance processors

Proceedings of the 2003 international symposium on Low power electronics and design
Aggressive Dynamic Execution of Multimedia Kernel Traces

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Branch predictor guided instruction decoding

Proceedings of the 15th international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.01

Visualization

Abstract

Microarchitectures that implement conventional instruction set architectures are usually limited in that they are only able to execute a small number of microoperations concurrently. This limitation is due in part to the fact that the units of work that the hardware treats as indivisible are small. While this limitation is not important for microarchitectures with a low level of functionality, it can be significant if the goal is to build hardware that can support a large number of microoperations executing concurrently. In this paper we address the tradeoffs associated with the sizes of the various units of work that a processor considers indivisible, or atomic. We argue that by allowing larger units of work to be atomic, restrictions on concurrent operation are reduced and performance is increased. We outline the implementation of a front end for a dynamically scheduled processor with hardware support for large atomic units. We discuss tradeoffs in the design and show that with a modest investment in hardware, the run-time advantages of large atomic units can be realized without the need to alter the instruction set architecture.