Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Authors:
Steven Dropsho;Greg Semeraro;David H. Albonesi;Grigorios Magklis;Michael L. Scott
Affiliations:
University of Rochester;University of Rochester;University of Rochester;University of Rochester;University of Rochester
Venue:
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2004

Citing 19
Cited 8

Pipeline Design Tradeoffs in a 32-bit Gallium Arsenide Microprocessor

IEEE Transactions on Computers
Dynamic IPC/clock rate optimization

Proceedings of the 25th annual international symposium on Computer architecture
Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A circuit level implementation of an adaptive issue queue for power-aware microprocessors

GLSVLSI '01 Proceedings of the 11th Great Lakes symposium on VLSI
Energy-effective issue logic

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The optimum pipeline depth for a microprocessor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Power and performance evaluation of globally asynchronous locally synchronous processors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Managing multi-configuration hardware via dynamic working set analysis

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing set-associative cache energy via way-prediction and selective direct-mapping

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Joint local and global hardware adaptations for energy

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Runtime Reconfiguration Techniques for Efficient General-Purpose Computation

IEEE Design & Test
Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Dynamic frequency and voltage control for a multiple clock domain microarchitecture

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Interfacing Synchronous and Asynchronous Modules Within a High-Speed Pipeline

ARVLSI '97 Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI '97)
Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor

Proceedings of the 30th annual international symposium on Computer architecture
Energy-Efficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture

Power reduction techniques for microprocessor systems

ACM Computing Surveys (CSUR)
Effective management of multiple configurable units using dynamic optimization

ACM Transactions on Architecture and Code Optimization (TACO)
Architectural contesting: exposing and exploiting temperamental behavior

ACM SIGARCH Computer Architecture News - Special issue on the 2006 reconfigurable and adaptive architecture workshop
Dynamic MIPS rate stabilization in out-of-order processors

Proceedings of the 36th annual international symposium on Computer architecture
Improving SMT performance: an application of genetic algorithms to configure resizable caches

Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers
Dynamic capacity-speed tradeoffs in SMT processor caches

HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Simulating a LAGS processor to consider variable latency on L1 D-Cache

Proceedings of the 2010 Summer Computer Simulation Conference
A phase adaptive cache hierarchy for SMT processors

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Microprocessors are traditionally designed to provide "best overall" performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by "downsizing" hardware resources that are underutilized by the current application phase. Others have proposed a different energy-saving approach: dividing the processor into domains and dynamically changing the clock frequency and voltage within each domain during phases when the full domain frequency is not required. What has not been studied to date is how to exploit the adaptive nature of these approaches to improve performance rather than to save energy. In this paper, we describe an adaptive globally asynchronous, locally synchronous (GALS) microprocessor with a fixed global voltage and four independently clocked domains. Each domain is streamlined with modest hardware structures for very high clock frequency. Key structures can then be upsized on demand to exploit more distant parallelism, improve branch prediction, or increase cache capacity. Although doing so requires decreasing the associated domain frequency, other domain frequencies are unaffected. Our approach, therefore, is to maximize the throughput of each domain by finding the proper balance between the number of clock periods, and the clock frequency, for each application phase. To achieve this objective, we use novel hardware-based control techniques that accurately and efficiently capture the performance of all possible cache and queue configurations within a single interval, without having to resort to exhaustive online exploration or expensive offline profiling. Measuring across a broad suite of application benchmarks, we find that configuring our adaptive GALS processor just once per application yields 17.6% better performance, on average, than that of the "best overall" fully synchronous design. By adapting automatically to application phases, we can increase this advantage to more than 20%.