Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs

Authors:
Asit K. Mishra;Xiangyu Dong;Guangyu Sun;Yuan Xie;N. Vijaykrishnan;Chita R. Das
Affiliations:
The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA;The Pennsylvania State University, University Park, PA, USA
Venue:
Proceedings of the 38th annual international symposium on Computer architecture
Year:
2011

Citing 17
Cited 8

Route packets, not wires: on-chip inteconnection networks

Proceedings of the 38th annual Design Automation Conference
Symbiotic jobscheduling for a simultaneous multithreaded processor

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Orion: a power-performance simulator for interconnection networks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A Delay Model and Speculative Architecture for Pipelined Routers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Design space exploration for 3D architectures

ACM Journal on Emerging Technologies in Computing Systems (JETC)
PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Die Stacking (3D) Microarchitecture

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement

Proceedings of the 45th annual Design Automation Conference
System-Level Performance Metrics for Multiprogram Workloads

IEEE Micro
Architecting phase change memory as a scalable dram alternative

Proceedings of the 36th annual international symposium on Computer architecture
Scalable high performance main memory system using phase-change memory technology

Proceedings of the 36th annual international symposium on Computer architecture
Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Energy reduction for STT-RAM using early write termination

Proceedings of the 2009 International Conference on Computer-Aided Design
Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing

Proceedings of the 37th annual international symposium on Computer architecture
Energy- and endurance-aware design of phase change memory caches

Proceedings of the Conference on Design, Automation and Test in Europe
Modeling, Architecture, and Applications for Emerging Memory Technologies

IEEE Design & Test

HaVOC: a hybrid memory-aware virtualization layer for on-chip distributed ScratchPad and non-volatile memories

Proceedings of the 49th Annual Design Automation Conference
Design space exploration of workload-specific last-level caches

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
A practical method for estimating performance degradation on multicore processors, and its application to HPC workloads

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On-chip caches built on multilevel spin-transfer torque RAM cells and its optimizations

ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special issue on memory technologies
SPaC: a segment-based parallel compression for backup acceleration in nonvolatile processors

Proceedings of the Conference on Design, Automation and Test in Europe
A network congestion-aware memory subsystem for manycore

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
Optimal placement of vertical connections in 3D Network-on-Chip

Journal of Systems Architecture: the EUROMICRO Journal
DHeating: dispersed heating repair for self-healing NAND flash memory

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Emerging memory technologies such as STT-RAM, PCRAM, and resistive RAM are being explored as potential replacements to existing on-chip caches or main memories for future multi-core architectures. This is due to the many attractive features these memory technologies posses: high density, low leakage, and non-volatility. However, the latency and energy overhead associated with the write operations of these emerging memories has become a major obstacle in their adoption. Previous works have proposed various circuit and architectural level solutions to mitigate the write overhead. In this paper, we study the integration of STT-RAM in a 3D multi-core environment and propose solutions at the on-chip network level to circumvent the write overhead problem in the cache architecture with STT-RAM technology. Our scheme is based on the observation that instead of staggering requests to a write-busy STT-RAM bank, the network should schedule requests to other idle cache banks for effectively hiding the latency. Thus, we prioritize cache accesses to the idle banks by delaying accesses to the STT-RAM cache banks that are currently serving long latency write requests. Through a detailed characterization of the cache access patterns of 42 applications, we propose an efficient mechanism to facilitate such delayed writes to cache banks by (a) accurately estimating the busy time of each cache bank through logical partitioning of the cache layer and (b) prioritizing packets in a router requesting accesses to idle banks. Evaluations on a 3D architecture, consisting of 64 cores and 64 STT-RAM cache banks, show that our proposed approach provides 14% average IPC improvement for multi-threaded benchmarks, 19% instruction throughput benefits for multi-programmed workloads, and 6% latency reduction compared to a recently proposed write buffering mechanism.