A bridging model for parallel computation
Communications of the ACM
ROMM routing on mesh and torus networks
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Proceedings of the 27th annual international symposium on Computer architecture
Universal schemes for parallel communication
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Chap - a SIMD graphics processor
SIGGRAPH '84 Proceedings of the 11th annual conference on Computer graphics and interactive techniques
A Delay Model and Speculative Architecture for Pipelined Routers
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Principles and Practices of Interconnection Networks
Principles and Practices of Interconnection Networks
Low-Latency Virtual-Channel Routers for On-Chip Networks
Proceedings of the 31st annual international symposium on Computer architecture
Evaluating the Imagine Stream Architecture
Proceedings of the 31st annual international symposium on Computer architecture
Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Microarchitecture of a High-Radix Router
Proceedings of the 32nd annual international symposium on Computer Architecture
Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks
Proceedings of the 32nd annual international symposium on Computer Architecture
Design tradeoffs for tiled CMP on-chip networks
Proceedings of the 20th annual international conference on Supercomputing
Express virtual channels: towards the ideal interconnection fabric
Proceedings of the 34th annual international symposium on Computer architecture
IEEE Micro
Flattened Butterfly Topology for On-Chip Networks
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A case for bufferless routing in on-chip networks
Proceedings of the 36th annual international symposium on Computer architecture
Achieving predictable performance through better memory controller placement in many-core CMPs
Proceedings of the 36th annual international symposium on Computer architecture
A view of the parallel computing landscape
Communications of the ACM - A View of Parallel Computing
Exploring concentration and channel slicing in on-chip network router
NOCS '09 Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip
Complexity effective memory access scheduling for many-core accelerator architectures
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Low-cost router microarchitecture for on-chip networks
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Cohesion: a hybrid memory model for accelerators
Proceedings of the 37th annual international symposium on Computer architecture
Area-I/O flip-chip routing for chip-package co-design considering signal skews
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration
Proceedings of the Conference on Design, Automation and Test in Europe
Throughput-Effective On-Chip Networks for Manycore Accelerators
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A case for heterogeneous on-chip interconnects for CMPs
Proceedings of the 38th annual international symposium on Computer architecture
CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
NOCS '12 Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip
NOCS '12 Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip
NOC-Out: Microarchitecting a Scale-Out Processor
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
As the number of cores and threads in throughput accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This article explores throughput-effective Network-on-Chips (NoC) for future compute accelerators that employ Bulk-Synchronous Parallel (BSP) programming models such as CUDA and OpenCL. A hardware optimization is “throughput effective” if it improves parallel application-level performance per unit chip area. We evaluate performance of future looking workloads using detailed closed-loop simulations modeling compute nodes, NoC, and the DRAM memory system. We start from a mesh design with bisection bandwidth balanced to off-chip demand. Accelerator workloads tend to demand high off-chip memory bandwidth which results in a many-to-few traffic pattern when coupled with expected technology constraints of slow growth in pins-per-chip. Leveraging these observations we reduce NoC area by proposing a “checkerboard” NoC which alternates between conventional full routers and half routers with limited connectivity. Next, we show that increasing network terminal bandwidth at the nodes connected to DRAM controllers alleviates a significant fraction of the remaining imbalance resulting from the many-to-few traffic pattern. Furthermore, we propose a “double checkerboard inverted” NoC organization which takes advantage of channel slicing to reduce area while maintaining the performance improvements of the aforementioned techniques. This organization also has a simpler routing mechanism and improves average application throughput per unit area by 24.3%.