Optimization of interconnects between accelerators and shared memories in dark silicon

Authors:
Jason Cong;Bingjun Xiao
Affiliations:
University of California, Los Angeles, California;University of California, Los Angeles, California
Venue:
Proceedings of the International Conference on Computer-Aided Design
Year:
2013

Citing 16
Cited 0

Generating highly-routable sparse crossbars for PLDs

FPGA '00 Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays
Domain-Specific Codesign for Embedded Security

Computer
Design tradeoffs for tiled CMP on-chip networks

Proceedings of the 20th annual international conference on Supercomputing
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Run-time instruction set selection in a transmutable embedded processor

Proceedings of the 45th annual Design Automation Conference
Dynamic coprocessor management for FPGA-enhanced compute platforms

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Conservation cores: reducing the energy of mature computations

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Understanding sources of inefficiency in general-purpose chips

Proceedings of the 37th annual international symposium on Computer architecture
Introduction to the wire-speed processor and architecture

IBM Journal of Research and Development
Customizable Domain-Specific Computing

IEEE Design & Test
The accelerator store: A shared memory framework for accelerator-based systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
A tree-based topology synthesis for on-chip network

Proceedings of the International Conference on Computer-Aided Design
Architecture support for accelerator-rich CMPs

Proceedings of the 49th Annual Design Automation Conference
High-Level Synthesis for FPGAs: From Prototyping to Deployment

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Enhancing effective throughput for transmission line-based bus

Proceedings of the 39th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Application-specific accelerators provide orders-of-magnitude improvement in energy-efficiency over CPUs, and accelerator-rich computing platforms are showing promise in the dark silicon age. Memory sharing among accelerators leads to huge transistor savings, but needs novel designs of interconnects between accelerators and shared memories. Accelerators run 100x faster than CPUs and post a high demand on data. This leads to resource-consuming interconnects if we follow the same design rules as those for interconnects between CPUs and shared memories, and simply duplicate the interconnect hardware to meet the accelerator data demand. In this work we develop a novel design of interconnects between accelerators and shared memories and exploit three optimization opportunities that emerge in accelerator-rich computing platforms: 1) The multiple data ports of the same accelerators are powered on/off together, and the competition for shared resources among these ports can be eliminated to save interconnect transistor cost; 2) In dark silicon, the number of active accelerators in an accelerator-rich platform is usually limited, and the interconnects can be partially populated to just fit the data access demand limited by the power budget; 3) The heterogeneity of accelerators leads to execution patterns among accelerators and, based on the probability analysis to identify these patterns, interconnects can be optimized for the expected utilization. Experiments show that our interconnect design outperforms prior work that was optimized for CPU cores or signal routing.