LP-NUCA: networks-in-cache for high-performance low-power embedded processors

Authors:
Darío Suárez Gracia;Giorgos Dimitrakopoulos;Teresa Monreal Arnal;Manolis G. H. Katevenis;Víctor Viñals Yúfera
Affiliations:
Computer Architecture Group, Departamento de Informática e Ingeniería de Sistemas, Instituto de Investigación en Ingeniería de Aragón, Universidad de Zaragoza, Zaragoza, S ...;Informatics and Communications Engineering Department, University of West Macedonia, Kozani, Greece;Department of Computer Architecture, Universitat Politécnica de Catalunya, Catalunya, Spain and Computer Architecture Group, Universidad de Zaragoza, Zaragoza, Spain;Foundation for Research and Technology-Hellas, Institute of Computer Science, Computer Architecture and VLSI Systems Laboratory, Heraklion, Crete and Department of Computer Science, University of ...;Computer Architecture Group, Departamento de Informática e Ingeniería de Sistemas, Instituto de Investigación en Ingeniería de Aragón, Universidad de Zaragoza, Zaragoza, S ...
Venue:
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Year:
2012

Citing 27
Cited 0

Inexpensive implementations of set-associativity

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
A case for two-way skewed-associative caches

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
SPEC CPU2000: Measuring CPU Performance in the New Millennium

Computer
Data cache locking for higher program predictability

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Method to Improve the Estimated Worst-Case Performance of Data Caching

RTCSA '99 Proceedings of the Sixth International Conference on Real-Time Computing Systems and Applications
Low-Latency Virtual-Channel Routers for On-Chip Networks

Proceedings of the 31st annual international symposium on Computer architecture
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Bounding Preemption Delay within Data Cache Reference Patterns for Real-Time Tasks

RTAS '06 Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium
Worst case timing analysis of input dependent data cache behavior

ECRTS '06 Proceedings of the 18th Euromicro Conference on Real-Time Systems
SPEC CPU2006 benchmark descriptions

ACM SIGARCH Computer Architecture News
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Computer Architecture, Fourth Edition: A Quantitative Approach

Computer Architecture, Fourth Edition: A Quantitative Approach
Interconnect design considerations for large NUCA caches

Proceedings of the 34th annual international symposium on Computer architecture
A Domain-Specific On-Chip Network Design for Large Scale Cache Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
The worst-case execution-time problem—overview of methods and survey of tools

ACM Transactions on Embedded Computing Systems (TECS)
IBM POWER6 SRAM arrays

IBM Journal of Research and Development
Exploring locking & partitioning for predictable shared caches on multi-cores

Proceedings of the 45th annual Design Automation Conference
SP-NUCA: a cost effective dynamic non-uniform cache architecture

ACM SIGARCH Computer Architecture News
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
No cache-coherence: a single-cycle ring interconnection for multi-core L1-NUCA sharing on 3D chips

Proceedings of the 46th Annual Design Automation Conference
LRU-PEA: a smart replacement policy for non-uniform cache architectures on chip multiprocessors

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor

IEEE Micro
The auction: optimizing banks usage in Non-Uniform Cache Architectures

Proceedings of the 24th ACM International Conference on Supercomputing
Light NUCA: a proposal for bridging the inter-cache latency gap

Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-end embedded processors demand complex on-chip cache hierarchies satisfying several contradicting design requirements such as high-performance operation and low energy consumption. This paper introduces light-power (LP) nonuniform cache architecture (NUCA), a tiled-cache addressing both goals. LP-NUCA places a group of small and low-latency tiles between the L1 and the last level cache (LLC) that adapt better to the application working sets and keep most recently evicted blocks close to L1. LP-NUCA is built around three specialized "networks-in-cache," each aimed at a separate cache operation. To prove the design feasibility, we have fully implemented LP-NUCA in a 90-nm technology. From the VLSI implementation, we observe that the proposed networks-in-cache incur minimal area, latency, and power overhead. To further reduce the energy consumption, LP-NUCA employs two network-wide techniques (miss wave stopping and sectoring) that together reduce the dynamic cache energy by 35% without degrading performance. Our evaluations also show that LP-NUCA improves performance with respect to cache hierarchies similar to those found in high-end embedded processors. Similar results have been obtained after scaling to a 32-nm technology.