Generating highly-routable sparse crossbars for PLDs
FPGA '00 Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays
Design tradeoffs for tiled CMP on-chip networks
Proceedings of the 20th annual international conference on Supercomputing
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Run-time instruction set selection in a transmutable embedded processor
Proceedings of the 45th annual Design Automation Conference
Dynamic coprocessor management for FPGA-enhanced compute platforms
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Conservation cores: reducing the energy of mature computations
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Understanding sources of inefficiency in general-purpose chips
Proceedings of the 37th annual international symposium on Computer architecture
Introduction to the wire-speed processor and architecture
IBM Journal of Research and Development
Customizable Domain-Specific Computing
IEEE Design & Test
The accelerator store: A shared memory framework for accelerator-based systems
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
A tree-based topology synthesis for on-chip network
Proceedings of the International Conference on Computer-Aided Design
Architecture support for accelerator-rich CMPs
Proceedings of the 49th Annual Design Automation Conference
High-Level Synthesis for FPGAs: From Prototyping to Deployment
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
BiN: a buffer-in-NUCA scheme for accelerator-rich CMPs
Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Enhancing effective throughput for transmission line-based bus
Proceedings of the 39th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Application-specific accelerators provide orders-of-magnitude improvement in energy-efficiency over CPUs, and accelerator-rich computing platforms are showing promise in the dark silicon age. Memory sharing among accelerators leads to huge transistor savings, but needs novel designs of interconnects between accelerators and shared memories. Accelerators run 100x faster than CPUs and post a high demand on data. This leads to resource-consuming interconnects if we follow the same design rules as those for interconnects between CPUs and shared memories, and simply duplicate the interconnect hardware to meet the accelerator data demand. In this work we develop a novel design of interconnects between accelerators and shared memories and exploit three optimization opportunities that emerge in accelerator-rich computing platforms: 1) The multiple data ports of the same accelerators are powered on/off together, and the competition for shared resources among these ports can be eliminated to save interconnect transistor cost; 2) In dark silicon, the number of active accelerators in an accelerator-rich platform is usually limited, and the interconnects can be partially populated to just fit the data access demand limited by the power budget; 3) The heterogeneity of accelerators leads to execution patterns among accelerators and, based on the probability analysis to identify these patterns, interconnects can be optimized for the expected utilization. Experiments show that our interconnect design outperforms prior work that was optimized for CPU cores or signal routing.