IEEE Transactions on Computers
Limits of instruction-level parallelism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A high-performance microarchitecture with hardware-programmable functional units
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Supporting dynamic data structures on distributed-memory machines
ACM Transactions on Programming Languages and Systems (TOPLAS)
The performance potential of data dependence speculation & collapsing
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Improving trace cache effectiveness with branch promotion and trace packing
Proceedings of the 25th annual international symposium on Computer architecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A C compiler for a processor with a reconfigurable functional unit
FPGA '00 Proceedings of the 2000 ACM/SIGDA eighth international symposium on Field programmable gate arrays
High-performance carry chains for FPGA's
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
Proceedings of the 27th annual international symposium on Computer architecture
PipeRench implementation of the instruction path coprocessor
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Increasing the size of atomic instruction blocks using control flow assertions
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
rePLay: A Hardware Framework for Dynamic Optimization
IEEE Transactions on Computers
Reconfigurable computing: a survey of systems and software
ACM Computing Surveys (CSUR)
Performance characterization of a hardware mechanism for dynamic optimization
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A design space evaluation of grid processor architectures
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Efficient architecture/compiler co-exploration for ASIPs
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
High-Performance 3-1 Interlock Collapsing ALU's
IEEE Transactions on Computers
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Proceedings of the 30th annual international symposium on Computer architecture
Processor Acceleration Through Automated Instruction Set Customization
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
MiBench: A free, commercially representative embedded benchmark suite
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Static strands: safely collapsing dependence chains for increasing embedded power efficiency
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
Exploring the design space of LUT-based transparent accelerators
Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Serialization-Aware Mini-Graphs: Performance with Fewer Resources
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping
Proceedings of the International Symposium on Code Generation and Optimization
Static strands: Safely exposing dependence chains for increasing embedded power efficiency
ACM Transactions on Embedded Computing Systems (TECS) - Special Section LCTES'05
An Application Development Framework for ARISE Reconfigurable Processors
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
The ARISE approach for extending embedded processors with arbitrary hardware accelerators
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
SoftHV: a HW/SW co-designed processor with horizontal and vertical fusion
Proceedings of the 8th ACM International Conference on Computing Frontiers
Functional unit chaining: a runtime adaptive architecture for reducing bypass delays
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Speculative hardware/software co-designed floating-point multiply-add fusion
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
A just-in-time customizable processor
Proceedings of the International Conference on Computer-Aided Design
Hi-index | 0.00 |
In this article, we present an approach for improving the performance of sequences of dependent instructions. We observe that many sequences of instructionscan be interpreted as functions. Unlike sequences of instructions, functions can be translated into very fast butexponentially costly two-level combinational circuits. Wepresent an approach that exploits this principle, speeds upprograms thanks to circuit-level parallelism/redundancy,but avoids the exponential costs.We analyze the potential of this approach, and thenwe propose an implementation that consists of a superscalar processor with a large specific functional unit associated with specific back-end transformations. The performance of the SpecInt2000 benchmarks and selectedprograms from the Olden and MiBench benchmark suitesimproves on average from 2.4% to 12% depending on thelatency of the functional units, and up to 39.6%; moreprecisely, the performance of optimized code sections improves on average from 3.5% to 19%, and up to 49%.