Analysis of static and dynamic energy consumption in NUCA caches: initial results

Authors:
Alessandro Bardine;Pierfrancesco Foglia;Giacomo Gabrielli;Cosimo Antonio Prete
Affiliations:
Università di Pisa, Pisa, Italy;Università di Pisa, Pisa, Italy;Università di Pisa, Pisa, Italy;Università di Pisa, Pisa, Italy
Venue:
MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Year:
2007

Citing 20
Cited 8

The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Will Physical Scalability Sabotage Performance Gains?

Computer
A Power Model for Routers: Modeling Alpha 21364 and InfiniBand Routers

IEEE Micro
Drowsy instruction caches: leakage power reduction using dynamic voltage scaling and cache sub-bank prediction

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories

ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
The Alpha 21264 Microprocessor Architecture

ICCD '98 Proceedings of the International Conference on Computer Design
Static Energy Reduction Techniques for Microprocessor Caches

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Static energy reduction techniques for microprocessor caches

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2001 international conference on computer design (ICCD)
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Leakage Current: Moore's Law Meets Static Power

Computer
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
Exploring the limits of leakage power reduction in caches

ACM Transactions on Architecture and Code Optimization (TACO)
Power reduction techniques for microprocessor systems

ACM Computing Surveys (CSUR)
Interconnect design considerations for large NUCA caches

Proceedings of the 34th annual international symposium on Computer architecture
A Domain-Specific On-Chip Network Design for Large Scale Cache Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches

IEEE Micro
Quantitative analysis and optimization techniques for on-chip cache leakage power

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A NUCA Substrate for Flexible CMP Cache Sharing

IEEE Transactions on Parallel and Distributed Systems

LRU-PEA: a smart replacement policy for non-uniform cache architectures on chip multiprocessors

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
The auction: optimizing banks usage in Non-Uniform Cache Architectures

Proceedings of the 24th ACM International Conference on Supercomputing
Way adaptable D-NUCA caches

International Journal of High Performance Systems Architecture
A power-efficient migration mechanism for D-NUCA caches

Proceedings of the Conference on Design, Automation and Test in Europe
Comparing last-level cache designs for CMP architectures

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
The migration prefetcher: Anticipating data promotion in dynamic NUCA caches

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Replacement techniques for dynamic NUCA cache designs on CMPs

The Journal of Supercomputing
Exploiting replication to improve performances of NUCA-based CMP systems

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

NUCA caches are large L2 on-chip cache memories characterized by multi-bank partitioning and designed to hide wire delay effects. They exhibit high hit rates while keeping access latency low. Proposed designs for such caches are Static NUCA, in which data are statically allocated to the cache banks, and Dynamic NUCA, in which data may reside in different banks, and a migration mechanism is introduced to better tolerate wire delay effects. The two architectures permit to achieve different performances by acting on architectural parameters and data management policies, at the cost of different balances between static and dynamic power consumption and energy dissipation. In this work, we propose preliminary results of the characterization of such balances, by presenting an evaluation of performance and energy consumption of conventional UCAs, and Static and Dynamic NUCA caches. All the considered caches architectures are equal sized and they are supposed to be used in an aggressive high frequency system running some applications from the SPEC CPU2000 and the NAS Parallel Benchmarks suites. The experimental results obtained indicate that, although the migration of data contributes to increase the dynamic energy consumption in Dynamic NUCA caches, the higher IPC achieved permits to save static energy, which dominates the power/energy balance in all the considered architectures. As a consequence, such results would designate NUCA caches as the most performing and energy saving architectures. Besides, according to the obtained results, future power improvements for NUCA caches should concentrate on static energy, while, for the dynamic energy, the on-chip network is the most critical element. Migration of data is acceptable, since it has a positive impact on performance, and the increased dynamic energy is overwhelmed by the static energy savings resulting from the shorter execution time. In order to give a general validity to such statements, we need to explore more design space points for each architecture (by varying the running clock rate and other design parameters) and to evaluate them considering a larger set of benchmarks.