Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture

Authors:
Jaekyu Lee;Si Li;Hyesoon Kim;Sudhakar Yalamanchili
Affiliations:
-;-;-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 33
Cited 0

Virtual-channel flow control

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The adaptive bubble router

Journal of Parallel and Distributed Computing
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Principles and Practices of Interconnection Networks

Principles and Practices of Interconnection Networks
DyAD: smart routing for networks-on-chip

Proceedings of the 41st annual Design Automation Conference
Exploiting the Routing Flexibility for Energy/Performance Aware Mapping of Regular NoC Architectures

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Flattened butterfly: a cost-efficient topology for high-radix networks

Proceedings of the 34th annual international symposium on Computer architecture
On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
A Hybrid Ring/Mesh Interconnect for Network-on-Chip Using Hierarchical Rings for Global Routing

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
NoC Topologies Exploration based on Mapping and Simulation Models

DSD '07 Proceedings of the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Proceedings of the 36th annual international symposium on Computer architecture
A case for bufferless routing in on-chip networks

Proceedings of the 36th annual international symposium on Computer architecture
Achieving predictable performance through better memory controller placement in many-core CMPs

Proceedings of the 36th annual international symposium on Computer architecture
Application-aware prioritization mechanisms for on-chip networks

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Throughput-oriented NoC topology generation and analysis for high performance SoCs

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
COMPASS: a programmable data prefetcher using idle GPU shaders

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An analysis of on-chip interconnection networks for large-scale chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
High performance cache replacement using re-reference interval prediction (RRIP)

Proceedings of the 37th annual international symposium on Computer architecture
Aérgia: exploiting packet latency slack in on-chip networks

Proceedings of the 37th annual international symposium on Computer architecture
aEqualized: a novel routing algorithm for the Spidergon network on chip

Proceedings of the Conference on Design, Automation and Test in Europe
ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing

FPL '10 Proceedings of the 2010 International Conference on Field Programmable Logic and Applications
Throughput-Effective On-Chip Networks for Manycore Accelerators

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A case for heterogeneous on-chip interconnects for CMPs

Proceedings of the 38th annual international symposium on Computer architecture
Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees

Proceedings of the 38th annual international symposium on Computer architecture
DBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip

Proceedings of the 38th annual international symposium on Computer architecture
TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
CPU-assisted GPGPU on fused CPU-GPU architectures

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC

Proceedings of the 49th Annual Design Automation Conference
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems

Proceedings of the 39th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Incorporating a GPU architecture into CMP, which is more efficient with certain types of applications, is a popular architecture trend in recent processors. This heterogeneous mix of architectures will use an on-chip interconnection to access shared resources such as last-level cache tiles and memory controllers. The configuration of this on-chip network will likely have a significant impact on resource distribution, fairness, and overall performance. The heterogeneity of this architecture inevitably exerts different pressures on the interconnection due to the differing characteristics and requirements of applications running on CPU and GPU cores. CPU applications are sensitive to latency, while GPGPU applications require massive bandwidth. This is due to the difference in the thread-level parallelism of the two architectures. GPUs use more threads to hide the effect of memory latency but require massive bandwidth to supply those threads. On the other hand, CPU cores typically running only one or two threads concurrently are very sensitive to latency. This study surveys the impact and behavior of the interconnection network when CPU and GPGPU applications run simultaneously. Among our findings, we observed that significant interference exists between CPU and GPU applications and resource partitioning, in particular virtual and physical channel partitioning, shows effectiveness to solve the interference problem. Also, heterogeneous link configurations show promising results by optimizing traffic hotspots in the network. Finally, we evaluated different placement policies and found that how to place different components in the network significantly affects the performance. Based on these findings, we suggest an optimal ring interconnect network. Our study will shed light on other architectural interconnection studies on CPU-GPU heterogeneous architectures.