McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures

Authors:
Sheng Li;Jung Ho Ahn;Richard D. Strong;Jay B. Brockman;Dean M. Tullsen;Norman P. Jouppi
Affiliations:
University of Notre Dame and Hewlett-Packard Labs;Seoul National University and Hewlett-Packard Labs;University of California, San Diego;University of Notre Dame;Seoul National University;Hewlett-Packard Labs
Venue:
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2009

Citing 25
Cited 130

The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Digital integrated circuits: a design perspective

Digital integrated circuits: a design perspective
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Power considerations in the design of the Alpha 21264 microprocessor

DAC '98 Proceedings of the 35th annual Design Automation Conference
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Automatically characterizing large scale program behavior

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Alpha 21264 Microprocessor

IEEE Micro
Parameter variations and impact on circuits and microarchitecture

Proceedings of the 40th annual Design Automation Conference
Hyperthreading Technology in the Netburst Microarchitecture

IEEE Micro
Technology Independent Area and Delay Estimations for MicroprocessorBuilding Blocks

Technology Independent Area and Delay Estimations for MicroprocessorBuilding Blocks
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Power Gating with Multiple Sleep Modes

ISQED '06 Proceedings of the 7th International Symposium on Quality Electronic Design
Controlling program execution through binary instrumentation

ACM SIGARCH Computer Architecture News - Special issue on the 2005 workshop on binary instrumentation and application
The M5 Simulator: Modeling Networked Systems

IEEE Micro
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
Performance counters and development of SPEC CPU2006

ACM SIGARCH Computer Architecture News
A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration

Proceedings of the Conference on Design, Automation and Test in Europe
Analysis and future trend of short-circuit power

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing

Proceedings of the 37th annual international symposium on Computer architecture
Cost-aware three-dimensional (3D) many-core multiprocessor design

Proceedings of the 47th Design Automation Conference
Cost-driven 3D integration with interconnect layers

Proceedings of the 47th Design Automation Conference
Fabrication cost analysis and cost-aware design space exploration for 3-D ICs

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
LOFT: A High Performance Network-on-Chip Providing Quality-of-Service Support

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
The ZCache: Decoupling Ways and Associativity

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Characterizing the impact of process variation on 45 nm NoC-based CMPs

Journal of Parallel and Distributed Computing
The structural simulation toolkit

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
A framework for architecture-level power, area, and thermal simulation and its application to network-on-chip design exploration

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Run-time energy management of manycore systems through reconfigurable interconnects

Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI
EnerJ: approximate data types for safe and general low-power computation

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template

Proceedings of the 38th annual international symposium on Computer architecture
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Proceedings of the 38th annual international symposium on Computer architecture
Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput

Proceedings of the 38th annual international symposium on Computer architecture
Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems

Proceedings of the 38th annual international symposium on Computer architecture
A study on factors influencing power consumption in multithreaded and multicore CPUs

WSEAS Transactions on Computers
An energy-efficient adaptive hybrid cache

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
The gem5 simulator

ACM SIGARCH Computer Architecture News
CACTI-FinFET: an integrated delay and power modeling framework for FinFET-based caches under process variations

Proceedings of the 48th Design Automation Conference
A reuse-aware prefetching scheme for scratchpad memory

Proceedings of the 48th Design Automation Conference
Matching cache access behavior and bit error pattern for high performance low Vcc L1 cache

Proceedings of the 48th Design Automation Conference
Token3D: reducing temperature in 3d die-stacked CMPs through cycle-level power control mechanisms

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
System implications of memory reliability in exascale computing

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Assuring application-level correctness against soft errors

Proceedings of the International Conference on Computer-Aided Design
CACTI-P: architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques

Proceedings of the International Conference on Computer-Aided Design
Improving System Energy Efficiency with Memory Rank Subsetting

ACM Transactions on Architecture and Code Optimization (TACO)
Architecture support for disciplined approximate programming

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
System-level integrated server architectures for scale-out datacenters

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A resistive TCAM accelerator for data-intensive computing

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Three-dimensional Integrated Circuits: Design, EDA, and Architecture

Foundations and Trends in Electronic Design Automation
Link-time optimization for power efficiency in a tagless instruction cache

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Design-time performance evaluation of thermal management policies for SRAM and RRAM based 3D MPSoCs

Proceedings of the great lakes symposium on VLSI
Looking back and looking forward: power, performance, and upheaval

Communications of the ACM
A limits study of benefits from nanostore-based future data-centric system architectures

Proceedings of the 9th conference on Computing Frontiers
Optimizing energy efficiency of 3-D multicore systems with stacked DRAM under power and thermal constraints

Proceedings of the 49th Annual Design Automation Conference
Architecture support for accelerator-rich CMPs

Proceedings of the 49th Annual Design Automation Conference
Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU

Proceedings of the 49th Annual Design Automation Conference
Write performance improvement by hiding R drift latency in phase-change RAM

Proceedings of the 49th Annual Design Automation Conference
Thermal management of a many-core processor under fine-grained parallelism

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Instruction-based energy estimation methodology for asymmetric manycore processor simulations

Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
SST + gem5 = a scalable simulation infrastructure for high performance computing

Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
UniFI: leveraging non-volatile memories for a unified fault tolerance and idle power management technique

Proceedings of the 26th ACM international conference on Supercomputing
Simulating the future kilo-x86-64 core processors and their infrastructure

Proceedings of the 45th Annual Simulation Symposium
Thermal-aware sampling in architectural simulation

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
TAP: token-based adaptive power gating

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
HANDS: heterogeneous architectures and networks-on-chip design and simulation

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
XIOSim: power-performance modeling of mobile x86 cores

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
MultiScale: memory system DVFS with multiple memory controllers

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Energy-efficient scheduling on heterogeneous multi-core architectures

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
CHARM: a composable heterogeneous accelerator-rich microprocessor

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Energy-efficient GPU design with reconfigurable in-package graphics memory

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Lane decoupling for improving the timing-error resiliency of wide-SIMD architectures

Proceedings of the 39th Annual International Symposium on Computer Architecture
Scale-out processors

Proceedings of the 39th Annual International Symposium on Computer Architecture
The dynamic granularity memory system

Proceedings of the 39th Annual International Symposium on Computer Architecture
Power-aware multi-core simulation for early design stage hardware/software co-optimization

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Workload and power budget partitioning for single-chip heterogeneous processors

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
When less is more (LIMO):controlled parallelism forimproved efficiency

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Don't burn your mobile!: safe computational re-sprinting via model predictive control

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Reducing NBTI-induced processor wearout by exploiting the timing slack of instructions

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient traffic aware power management in multicore communications processors

Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
LEAP: latency- energy- and area-optimized lookup pipeline

Proceedings of the eighth ACM/IEEE symposium on Architectures for networking and communications systems
Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs

ACM Transactions on Computer Systems (TOCS)
Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration

Proceedings of the International Conference on Computer-Aided Design
CACTI-IO: CACTI with off-chip power-area-timing models

Proceedings of the International Conference on Computer-Aided Design
Architecture support for custom instructions with memory operations

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing

ACM Transactions on Architecture and Code Optimization (TACO)
Predicting Performance Impact of DVFS for Realistic Memory Systems

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
CoScale: Coordinating CPU and Memory System DVFS in Server Systems

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Composite Cores: Pushing Heterogeneity Into a Core

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Control-Flow Decoupling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Spatiotemporal Coherence Tracking

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Neural Acceleration for General-Purpose Approximate Programs

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems

Proceedings of the 27th international ACM conference on International conference on supercomputing
Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement

Proceedings of the 27th international ACM conference on International conference on supercomputing
The ARMv8 simulator

Proceedings of the 27th international ACM conference on International conference on supercomputing
RFiof: an RF approach to I/O-pin and memory controller scalability for off-chip memories

Proceedings of the ACM International Conference on Computing Frontiers
Performance/reliability trade-off in superscalar processors for aggressive NBTI restoration of functional units

Proceedings of the 23rd ACM international conference on Great lakes symposium on VLSI
Cherry-picking: exploiting process variations in dark-silicon homogeneous chip multi-processors

Proceedings of the Conference on Design, Automation and Test in Europe
Self-adaptive hybrid dynamic power management for many-core systems

Proceedings of the Conference on Design, Automation and Test in Europe
MALEC: a multiple access low energy cache

Proceedings of the Conference on Design, Automation and Test in Europe
Thermal-aware datapath merging for coarse-grained reconfigurable processors

Proceedings of the Conference on Design, Automation and Test in Europe
Continuous real-world inputs can open up alternative accelerator designs

Proceedings of the 40th Annual International Symposium on Computer Architecture
AC-DIMM: associative computing with STT-MRAM

Proceedings of the 40th Annual International Symposium on Computer Architecture
GPUWattch: enabling energy optimizations in GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Criticality stacks: identifying critical threads in parallel programs using synchronization behavior

Proceedings of the 40th Annual International Symposium on Computer Architecture
The locality-aware adaptive cache coherence protocol

Proceedings of the 40th Annual International Symposium on Computer Architecture
Lighting the dark silicon by exploiting heterogeneity on future processors

Proceedings of the 50th Annual Design Automation Conference
HaDeS: architectural synthesis for heterogeneous dark silicon chip multi-processors

Proceedings of the 50th Annual Design Automation Conference
VAWOM: temperature and process variation aware wearout management in 3D multicore architecture

Proceedings of the 50th Annual Design Automation Conference
Exploring the vulnerability of CMPs to soft errors with 3D stacked nonvolatile memory

ACM Journal on Emerging Technologies in Computing Systems (JETC)
A case study on the application of real phase-change RAM to main memory subsystem

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Statistical thermal modeling and optimization considering leakage power variations

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Analysis and runtime management of 3D systems with stacked DRAM for boosting energy efficiency

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
MAPG: memory access power gating

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Bloom filter-based dynamic wear leveling for phase-change RAM

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Silicon-aware distributed switch architecture for on-chip networks

Journal of Systems Architecture: the EUROMICRO Journal
Parallel frame rendering: trading responsiveness for energy on a mobile GPU

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Jigsaw: scalable software-defined caches

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Managing shared last-level cache in a heterogeneous multicore processor

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
TLC: a tag-less cache for reducing dynamic first level cache energy

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Warped gates: gating aware scheduling and power gating for GPGPUs

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Linearly compressed pages: a low-complexity, low-latency main memory compression framework

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Kiln: closing the performance gap between systems with and without persistence support

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Trace based phase prediction for tightly-coupled heterogeneous cores

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A circuit-architecture co-optimization framework for exploring nonvolatile memory hierarchies

ACM Transactions on Architecture and Code Optimization (TACO)
Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

ACM Transactions on Architecture and Code Optimization (TACO)
Market mechanisms for managing datacenters with heterogeneous microarchitectures

ACM Transactions on Computer Systems (TOCS)
The benefit of SMT in the multi-core era: flexibility towards degrees of thread-level parallelism

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Selecting representative benchmark inputs for exploring microprocessor design spaces

ACM Transactions on Architecture and Code Optimization (TACO)
System-level power estimation tool for embedded processor based platforms

Proceedings of the 6th Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
Automated, retargetable back-annotation for host compiled performance and power modeling

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
EVA: an efficient vision architecture for mobile systems

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
A generalized software framework for accurate and efficient management of performance goals

Proceedings of the Eleventh ACM International Conference on Embedded Software
Agent-based distributed power management for kilo-core processors

Proceedings of the International Conference on Computer-Aided Design
Improving platform energy: chip area trade-off in near-threshold computing environment

Proceedings of the International Conference on Computer-Aided Design
Dual partitioning multicasting for high-performance on-chip networks

Journal of Parallel and Distributed Computing
Performance and power profiling for emulated Android systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Unified reliability estimation and management of NoC based chip multiprocessors

Microprocessors & Microsystems
Architecture exploration based on GA-PSO optimization, ANN modeling, and static scheduling

VLSI Design
Measuring GPU Power with the K20 Built-in Sensor

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and doublegate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.