Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
High-level synthesis: introduction to chip and system design
High-level synthesis: introduction to chip and system design
An iterative improvement algorithm for low power data path synthesis
ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
LISA—machine description language for cycle-accurate models of programmable DSP architectures
Proceedings of the 36th annual ACM/IEEE Design Automation Conference
A method to derive application-specific embedded processing cores
CODES '00 Proceedings of the eighth international workshop on Hardware/software codesign
Generating Reliable Embedded Processors
IEEE Micro
A cycle-accurate compilation algorithm for custom pipelined datapaths
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Applying Resource Sharing Algorithms to ADL-driven Automatic ASIP Implementation
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
FPGA-friendly code compression for horizontal microcoded custom IPs
Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
C-based design flow: a case study on G.729A for voice over internet protocol (VoIP)
Proceedings of the 45th annual Design Automation Conference
Function Call Optimization for Efficient Behavioral Synthesis
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Merged Dictionary Code Compression for FPGA Implementation of Custom Microcoded PEs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
C-based design flow: a case study on G.729A for voice over internet protocol (VoIP)
Proceedings of the 45th annual Design Automation Conference
Logic synthesis and circuit customization using extensive external don't-cares
ACM Transactions on Design Automation of Electronic Systems (TODAES)
VariPipe: low-overhead variable-clock synchronous pipelines
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Proceedings of the 47th Design Automation Conference
Customizing IP cores for system-on-chip designs using extensive external don't-cares
Proceedings of the Conference on Design, Automation and Test in Europe
Word-Length Aware DSP Hardware Design Flow Based on High-Level Synthesis
Journal of Signal Processing Systems
High performance and area efficient flexible DSP datapath synthesis
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
System-Level Synthesis for Wireless Sensor Node Controllers: A Complete Design Flow
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Hi-index | 0.00 |
In this paper, we propose an approach for designing high-performance energy-efficient processing elements (PEs) using statically-scheduled nanocode-based architectures. Our approach is based on bottom-up refinement/trimming techniques that optimize a given datapath irrespective of whether it was designed manually or generated automatically. The optimizations can also preserve parts of the netlist specified by the designers, and hence, allow reuse of design efforts and can lead to predictable convergence. In this paper, we show that trimming unused and underutilized resources of typical general-purpose datapaths can lead to 30-40% average energy savings, without any performance loss. However, general-purpose architectures often compromise parallelism to make the design implementable. With our trimming approach, we can afford to have a base architecture that is not intended for implementation and has more parallelism, and then apply refinement to make it implementable. For our benchmarks, we achieved up to 1.8 times (avg. 25%) and 2.6 times (avg. 40%) performance improvement, compared to two general-purpose architectures (i.e. a 4-issue VLIW and a DLX), respectively. Additionally, the energy consumption is reduced by up to 5 times (avg. 2 times) compared to the trimmed general-purpose architectures.