Throughput-Effective On-Chip Networks for Manycore Accelerators

Authors:
Ali Bakhoda;John Kim;Tor M. Aamodt
Affiliations:
-;-;-
Venue:
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2010

Citing 31
Cited 9

A bridging model for parallel computation

Communications of the ACM
ROMM routing on mesh and torus networks

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Universal schemes for parallel communication

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Chap - a SIMD graphics processor

SIGGRAPH '84 Proceedings of the 11th annual conference on Computer graphics and interactive techniques
A Delay Model and Speculative Architecture for Pipelined Routers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Principles and Practices of Interconnection Networks

Principles and Practices of Interconnection Networks
Low-Latency Virtual-Channel Routers for On-Chip Networks

Proceedings of the 31st annual international symposium on Computer architecture
Evaluating the Imagine Stream Architecture

Proceedings of the 31st annual international symposium on Computer architecture
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Microarchitecture of a High-Radix Router

Proceedings of the 32nd annual international symposium on Computer Architecture
Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks

Proceedings of the 32nd annual international symposium on Computer Architecture
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Design tradeoffs for tiled CMP on-chip networks

Proceedings of the 20th annual international conference on Supercomputing
Express virtual channels: towards the ideal interconnection fabric

Proceedings of the 34th annual international symposium on Computer architecture
On-Chip Interconnection Architecture of the Tile Processor

IEEE Micro
Flattened Butterfly Topology for On-Chip Networks

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Token flow control

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A case for bufferless routing in on-chip networks

Proceedings of the 36th annual international symposium on Computer architecture
Achieving predictable performance through better memory controller placement in many-core CMPs

Proceedings of the 36th annual international symposium on Computer architecture
A view of the parallel computing landscape

Communications of the ACM - A View of Parallel Computing
Complexity effective memory access scheduling for many-core accelerator architectures

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Low-cost router microarchitecture for on-chip networks

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
A Task-Centric Memory Model for Scalable Accelerator Architectures

IEEE Micro
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration

Proceedings of the Conference on Design, Automation and Test in Europe

A case for heterogeneous on-chip interconnects for CMPs

Proceedings of the 38th annual international symposium on Computer architecture
Optimizing heterogeneous NoC design

Proceedings of the International Workshop on System Level Interconnect Prediction
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
NOC-Out: Microarchitecting a Scale-Out Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Designing on-chip networks for throughput accelerators

ACM Transactions on Architecture and Code Optimization (TACO)
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture

Journal of Parallel and Distributed Computing
On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of cores and threads in many core compute accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This paper explores throughput-effective network-on-chips (NoC) for future many core accelerators that employ bulk-synchronous parallel (BSP) programming models such as CUDA and OpenCL. A hardware optimization is "throughput-effective" if it improves parallel application level performance per unit chip area. We evaluate performance of future looking workloads using detailed closed-loop simulations modeling compute nodes, NoC and the DRAM memory system. We start from a mesh design with bisection bandwidth balanced with off-chip demand. Accelerator workloads tend to demand high off-chip memory bandwidth which results in a many-to-few traffic pattern when coupled with expected technology constraints of slow growth in pins-per-chip. Leveraging these observations we reduce NoC area by proposing a "checkerboard", NoCwhich alternates between conventional full-routers and half-routers with limited connectivity. Checkerboard employs a new oblivious routing algorithm that maintains a minimum hop-count for architectures that place L2 cache banks at the half-router nodes. Next, we show that increasing network injection bandwidth for the large amount of read reply traffic at the nodes connected to DRAM controllers alleviates a significant fraction of the remaining imbalance resulting from the many-to-few traffic pattern. The combined effect of the above optimizations with an improved placement of memory controllers in the mesh and channel slicing improves application throughput per unit area by 25.4%.