Best of Both Latency and Throughput

Authors:
Ed Grochowski;Ronny Ronen;John Shen;Hong Wang
Affiliations:
Intel Labs, Santa Clara, CA;Intel Israel;Intel Labs, Santa Clara, CA;Intel Labs, Santa Clara, CA
Venue:
ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Year:
2004

Citing 0
Cited 34

Mitigating Amdahl's Law through EPI Throttling

Proceedings of the 32nd annual international symposium on Computer Architecture
The Impact of Performance Asymmetry in Emerging Multicore Architectures

Proceedings of the 32nd annual international symposium on Computer Architecture
Heterogeneous Chip Multiprocessors

Computer
Power-performance considerations of parallel computing on chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Multiple Instruction Stream Processor

Proceedings of the 33rd annual international symposium on Computer Architecture
Core architecture optimization for heterogeneous chip multiprocessors

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Design space exploration for multicore architectures: a power/performance/thermal view

Proceedings of the 20th annual international conference on Supercomputing
An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Enabling scalability and performance in a large scale CMP environment

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Evaluating design tradeoffs in on-chip power management for CMPs

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Thermal-aware scheduling for future chip multiprocessors

EURASIP Journal on Embedded Systems
Hiding the misprediction penalty of a resource-efficient high-performance processor

ACM Transactions on Architecture and Code Optimization (TACO)
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Exploring power management in multi-core systems

Proceedings of the 2008 Asia and South Pacific Design Automation Conference
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
On the performance benefits of sharing and privatizing second and third-level cache memories in homogeneous multi-core architectures

Microprocessors & Microsystems
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Multitasking workload scheduling on flexible-core chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Low-complexity policies for energy-performance tradeoff in chip-multi-processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Multiple clock and voltage domains for chip multi processors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Conservation cores: reducing the energy of mature computations

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
LRU-PEA: a smart replacement policy for non-uniform cache architectures on chip multiprocessors

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
The auction: optimizing banks usage in Non-Uniform Cache Architectures

Proceedings of the 24th ACM International Conference on Supercomputing
Area-efficient floorplans and interconnects for homogeneous multi-core architectures

International Journal of High Performance Systems Architecture
Understanding throughput-oriented architectures

Communications of the ACM
Criticality-driven superscalar design space exploration

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Federation: Boosting per-thread performance of throughput-oriented manycore architectures

ACM Transactions on Architecture and Code Optimization (TACO)
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Phase-based tuning for better utilization of performance-asymmetric multicore processors

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Understanding fundamental design choices in single-ISA heterogeneous multicore architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
KnightShift: Scaling the Energy Proportionality Wall through Server-Level Heterogeneity

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Replacement techniques for dynamic NUCA cache designs on CMPs

The Journal of Supercomputing
Low-latency adaptive mode transitions and hierarchical power management in asymmetric clustered cores

ACM Transactions on Architecture and Code Optimization (TACO)
A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper describes the tradeoff between latency performance and throughput performance in a power-constrained environment. We show that the key to achieving both excellent latency performance as well as excellent throughput performance is to dynamically vary the amount of energy expended to process instructions according to the amount of parallelism available in the software. We survey four techniques for achieving variable energy per instruction: voltage/frequency scaling, asymmetric cores, variable-size cores, and speculation control. We estimate the potential range of energies obtainable by each technique and conclude that a combination of asymmetric cores and voltage/frequency scaling offers the most promising approach to designing a chip-level multiprocessor that can achieve both excellent latency performance and excellent throughput performance.