Architecture and compiler optimizations for data bandwidth improvement in configurable processors

Authors:
Jason Cong;Guoling Han;Zhiru Zhang
Affiliations:
Computer Science Department, University of California, Los Angeles, CA;Computer Science Department, University of California, Los Angeles, CA;Computer Science Department, University of California, Los Angeles, CA
Venue:
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Year:
2006

Citing 17
Cited 3

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Multi-way partitioning via spacefilling curves and dynamic programming

DAC '94 Proceedings of the 31st annual Design Automation Conference
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Linear scan register allocation

ACM Transactions on Programming Languages and Systems (TOPLAS)
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit

Proceedings of the 27th annual international symposium on Computer architecture
Designing domain-specific processors

Proceedings of the ninth international symposium on Hardware/software codesign
Efficient architecture/compiler co-exploration for ASIPs

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Instruction generation for hybrid reconfigurable systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Synthesis of custom processors based on extensible platforms

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Automatic application-specific instruction-set extensions under microarchitectural constraints

Proceedings of the 40th annual Design Automation Conference
From ASIC to ASIP: The Next Design Discontinuity

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Application-specific instruction generation for configurable processor architectures

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
The chimaera reconfigurable functional unit

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A Scalable Application-Specific Processor Synthesis Methodology

Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design
Instruction set extension with shadow registers for configurable processors

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Architecture and compilation for data bandwidth improvement in configurable embedded processors

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design

Way Stealing: cache-assisted automatic instruction set extensions

Proceedings of the 46th Annual Design Automation Conference
Virtual ways: efficient coherence for architecturally visible storage in automatic instruction set extensions

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Rapid evaluation of custom instruction selection approaches with FPGA estimation

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many commercially available embedded processors are capable of extending their base instruction set for a specific domain of applications. While steady progress has been made in the tools and methodologies of automatic instruction set extension for configurable processors, the limited data bandwidth available in the core processor (e.g., the number of simultaneous accesses to the register file) becomes a potential performance bottleneck.In this paper, we first present a quantitative analysis of the data bandwidth limitation in configurable processors, and then propose a novel low-cost architectural extension and associated compilation techniques to address the problem. Specifically, we embed a single control bit in the instruction op-codes to selectively copy the execution results to a set of hash-mapped shadow registers in the write-back stage. This can efficiently reduce the communication overhead due to data transfers between the core processor and the custom logic. We also present a novel simultaneous global shadow register binding with a hash function generation algorithm to take full advantage of the extension. The application of our approach leads to a nearly optimal performance speedup.