Software pipelining: an effective scheduling technique for VLIW machines
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Lx: a technology platform for customizable VLIW embedded processing
Proceedings of the 27th annual international symposium on Computer architecture
Microprocessor Architectures: From VLIW to Tta
Microprocessor Architectures: From VLIW to Tta
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Using Dynamic Binary Translation to Fuse Dependent Instructions
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dynamic coalescing for 16-bit instructions
ACM Transactions on Embedded Computing Systems (TECS)
Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware
IEEE Transactions on Computers
MiBench: A free, commercially representative embedded benchmark suite
WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
Reducing the cost of conditional transfers of control by using comparison specifications
Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Trace Scheduling: A Technique for Global Microcode Compaction
IEEE Transactions on Computers
Low-power data forwarding for VLIW embedded architectures
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Architecture Optimization of Application-Specific Implicit Instructions
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on CAPA'09, Special Section on WHS'09, and Special Section VCPSS' 09
Hi-index | 0.00 |
In this paper, we propose the dynamic configuration of application specific implicit instructions for pipelined processors to better exploit the available parallelism at instruction level. Given the target application, the compiler selects a set of candidate instructions to be implicitly executed - i.e. their execution is controlled through a data-driven model, which avoids explicit instruction fetch. Consequently, the clock cycles usually required for the explicit issues are saved, thus improving the performance and reducing the code size. The compiler generates the reconfiguration operations to properly setup the data-path. The processor pipeline has been optimized to support the parallel execution of implicitly issued instructions, requiring a limited hardware overhead. The proposed technique has a negligible impact on the processor ISA - only reconfiguration instructions are added - which also benefits the compiler development times, since the optimization can be almost seamlessly added to an existing compilation tool-chain. The proposed approach has been applied to DSP and multimedia kernel loops, comparing its performance with those of two different baseline architectures: a scalar MIPS processor and a 4-issue VLIW processor of the LX family provided by STMicroelectronics [5]. Experimental results show a speedup ranging from 10 to 35%, and an average code size reduction of 19%.