Increased Scalability and Power Efficiency by Using Multiple Speed Pipelines

Authors:
Emil Talpes;Diana Marculescu
Affiliations:
Carnegie Mellon University;Carnegie Mellon University
Venue:
Proceedings of the 32nd annual international symposium on Computer Architecture
Year:
2005

Citing 20
Cited 1

Exploiting instruction level parallelism in processors by caching scheduled groups

Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
A Trace Cache Microarchitecture and Evaluation

IEEE Transactions on Computers - Special issue on cache memory and related problems
Evaluation of Design Options for the Trace Cache Fetch Mechanism

IEEE Transactions on Computers - Special issue on cache memory and related problems
MPS: Miss-Path Scheduling for Multiple-Issue Processors

IEEE Transactions on Computers
The block-based trace cache

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A static power model for architects

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Micro-operation cache: a power aware frontend for the variable instruction length ISA

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
Power reduction through work reuse

ISLPED '01 Proceedings of the 2001 international symposium on Low power electronics and design
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Power and performance evaluation of globally asynchronous locally synchronous processors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Filtering Techniques to Improve Trace-Cache Efficiency

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Methods for true power minimization

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Managing power and performance for System-on-Chip designs using Voltage Islands

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Energy-Efficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Mixed-clock issue queue design for energy aware, high-performance cores

Proceedings of the 2004 Asia and South Pacific Design Automation Conference

Reusing cached schedules in an out-of-order processor with in-order issue logic

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most important problems faced by microarchitecture designers is the poor scalability of some of the current solutions with increased clock frequencies and wider pipelines. As several studies show, internal processor structures scale differently with decreasing device sizes. While in some cases the access latency is determined by the speed of the logic circuitry, for others it is dominated by the interconnect delay. Furthermore, while some stages can be super-pipelined with relatively small performance loss, others must be kept atomic. This paper proposes a possible solution to this problem, avoiding the traditional trade-off between parallelism and clock speed. First, allowing instructions to enter and leave the Issue Window in an asynchronously manner enables faster speeds in the front-end at the expense of small synchronization latencies. Second, using an Execution Cache for storing instructions that are already scheduled allows for bypassing the issue circuitry and thus clocking the execution core at higher frequencies. Combined, these two mechanisms result in a 50% to 60% performance increase for our test microarchitecture, without requiring a completely new scheduling mechanism. Furthermore, the proposed microarchitecture requires significantly less energy, with 30% reduction in a 0.13um or 20% in a 0.06um process technology over the original baseline.