Optimizing total power of many-core processors considering voltage scaling limit and process variations

Authors:
Jungseob Lee;Nam Sung Kim
Affiliations:
University of Wisconsin-Madison, Madison, WI, USA;University of Wisconsin-Madison, Madison, WI, USA
Venue:
Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Year:
2009

Citing 11
Cited 5

The optimum pipeline depth for a microprocessor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Optimum Power/Performance Pipeline Depth

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Design and reliability challenges in nanometer technologies

Proceedings of the 41st annual Design Automation Conference
Power-optimal pipelining in deep submicron technology

Proceedings of the 2004 international symposium on Low power electronics and design
Power efficiency for variation-tolerant multicore processors

Proceedings of the 2006 international symposium on Low power electronics and design
Impact of process variations on multicore performance symmetry

Proceedings of the conference on Design, automation and test in Europe
Impact of die-to-die and within-die parameter variations on the throughput distribution of multi-core processors

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Characterizing chip-multiprocessor variability-tolerance

Proceedings of the 45th annual Design Automation Conference
Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era

Computer

Energy-efficient scheduling of real-time periodic tasks in multicore systems

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Designing for dark silicon: a methodological perspective on energy efficient systems

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Architecturally homogeneous power-performance heterogeneous multicore systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A measurement study of GPU DVFS on energy conservation

Proceedings of the Workshop on Power-Aware Computing and Systems
Improving platform energy: chip area trade-off in near-threshold computing environment

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.01

Visualization

Abstract

Recently, processor manufacturers have integrated more than a hundred cores in a single die to deliver extremely high throughput for highly-parallel, data-intensive applications like physics simulations, 3D-graphics, etc. Meanwhile, excessive power consumption rather than silicon area will limit the performance of many-core processors running the aforementioned applications. In this paper, to optimize the total power of many-core processors, we analyze the impact of 1) the number of cores, 2) parallelism in applications, and 3) supply voltage scaling limit due to on-die memory failure at low supply voltage. Our analysis shows that doubling the number of cores with lower than nominal supply voltage offers the most cost-effective power reduction, resulting in up to 65% less power consumption for highly-parallel applications even when supply voltage scaling is limited to 0.7V. The reduced power, in turn, can be used to improve throughput at higher voltage in power-constrained many-core processors. Furthermore, we extend our analysis to consider within-die core-to-core frequency and leakage variations. When only a subset of cores in a many-core processor are to be chosen to achieve a demanded throughput, moderately fast and leaky cores always provide optimal power consumption. In addition, frequency-island clocking, which allows independent frequency for each core, leads to 7% less power consumption than global clocking, and it prefers the fastest core (among the chosen ones) to process the totally sequential portion of workload.