Group-caching for NoC based multicore cache coherent systems

Authors:
Wang Zuo;Shi Feng;Zuo Qi;Ji Weixing;Li Jiaxin;Deng Ning;Xue Licheng;Tan Yuan;Qiao Baojun
Affiliations:
Beijing Institute of Technology, Beijing, P.R. China;Beijing Institute of Technology, Beijing, P.R. China;Beijing Institute of Technology, Beijing, P.R. China;Beijing Institute of Technology, Beijing, P.R. China;Beijing Institute of Technology, Beijing, P.R. China;Beijing Institute of Technology, Beijing, P.R. China;Beijing Institute of Technology, Beijing, P.R. China;Beijing Institute of Technology, Beijing, P.R. China;Beijing Institute of Technology, Beijing, P.R. China
Venue:
Proceedings of the Conference on Design, Automation and Test in Europe
Year:
2009

Citing 21
Cited 2

Generating representative Web workloads for network and server performance evaluation

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
A generic architecture for on-chip packet-switched interconnections

DATE '00 Proceedings of the conference on Design, automation and test in Europe
Simics: A Full System Simulation Platform

Computer
Simulating a $2M Commercial Server on a $2K PC

Computer
Orion: a power-performance simulator for interconnection networks

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
DBmbench: fast and accurate database workload representation on modern microarchitecture

CASCON '05 Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative research
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
A Statistical Traffic Model for On-Chip Interconnection Networks

MASCOTS '06 Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation
Design tradeoffs for tiled CMP on-chip networks

Proceedings of the 20th annual international conference on Supercomputing
On-Chip Interconnection Architecture of the Tile Processor

IEEE Micro
On-Chip Interconnection Networks of the TRIPS Chip

IEEE Micro
A 5-GHz Mesh Interconnect for a Teraflops Processor

IEEE Micro
Flattened Butterfly Topology for On-Chip Networks

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture

An optimized multicore cache coherence design for exploiting communication locality

Proceedings of the great lakes symposium on VLSI
Dual partitioning multicasting for high-performance on-chip networks

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.03

Visualization

Abstract

Most CMPs use on-chip networks to connect cores and tend to integrate more simple cores on a single die. Low-radix networks, such as 2D-MESH, are widely used in tiled CMPs since they can be mapped to on-chip networks efficiently. However, low-radix networks introduce high network latency caused by long diameter. In this paper, we propose the use of group-caching design in NoC based multicore cache coherent systems. In our design, on-chip L2 banks are organized to form multiple groups. Each cache group behaves like a shared L2 cache for the cores inside cache group while the cache coherence between cache groups is maintained by coherence messages. Besides, group-caching also adopts the new cache replacement policy to improve the inefficient use of the aggregate L2 cache capacity. Compared to banked and shared L2 design, as most L2 accesses are served by local cache group, the hop count is significantly reduced. Experiment results based on full-system simulation show that for 2D-MESH, group-caching can increase the performance by 2%~8% compared to banked and shared L2 design, with network energy consumption reduced by 11%~13%. Experiment results also show that the communication overhead inside cache group plays an important role in the performance of group-caching.