Area-efficiency in CMP core design: co-optimization of microarchitecture and physical design

Authors:
Omid Azizi;Aqeel Mahesri;Sanjay J. Patel;Mark Horowitz
Affiliations:
Stanford University;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;Stanford University
Venue:
ACM SIGARCH Computer Architecture News
Year:
2009

Citing 15
Cited 2

The case for a single-chip multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The optimum pipeline depth for a microprocessor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing processor performance by implementing deeper pipelines

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Unified methodology for resolving power-performance tradeoffs at the microarchitectural and circuit levels

Proceedings of the 2002 international symposium on Low power electronics and design
Exploring the Design Space of Future CMPs

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
RISC I: A Reduced Instruction Set VLSI Computer

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
The optimum pipeline depth considering both power and performance

ACM Transactions on Architecture and Code Optimization (TACO)
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
Exploring the cache design space for large scale CMPs

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Core architecture optimization for heterogeneous chip multiprocessors

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Efficiently exploring architectural design spaces via predictive modeling

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Design space exploration for multicore architectures: a power/performance/thermal view

Proceedings of the 20th annual international conference on Supercomputing
Design tradeoffs for tiled CMP on-chip networks

Proceedings of the 20th annual international conference on Supercomputing
Illustrative Design Space Studies with Microarchitectural Regression Models

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture

Hardware implementation of micropolygon rasterization with motion and defocus blur

Proceedings of the Conference on High Performance Graphics
Energy-efficient multithreading for a hierarchical heterogeneous multicore through locality-cognizant thread generation

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we examine the area-performance design space of a processing core for a chip multiprocessor (CMP), considering both the architectural design space and the tradeoffs of the physical design on which the architecture relies. We first propose a methodology for performing an integrated optimization of both the micro-architecture and the physical circuit design of a microprocessor. In our approach, we use statistical and convex fitting methods to capture a large micro-architectural design space. We then characterize the area-delay tradeoffs of the underlying circuits through RTL synthesis. Finally, we establish the relationship between the architecture and the circuits in an integrative model, which we use to optimize the processor. As a case study, we apply this methodology to explore the performance-area tradeoffs in a highly parallel accelerator architecture for visual computing applications. Based on some early circuit tradeoff data, our results indicate that two separate designs are performance/area optimal for our set of benchmarks: a simpler single-issue, 2-way multithreaded core running at high-frequency, and a more aggressively tuned dual-issue 4-way multithreaded design running at a lower frequency.