Code generation schema for modulo scheduled loops
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
PipeRench: a co/processor for streaming multimedia acceleration
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Lx: a technology platform for customizable VLIW embedded processing
Proceedings of the 27th annual international symposium on Computer architecture
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
Proceedings of the 27th annual international symposium on Computer architecture
IEEE Transactions on Computers
Code generator optimizations for the ST120 DSP-MCU core
CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Modulo scheduling for a fully-distributed clustered VLIW architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
CALiBeR: a software pipelining algorithm for clustered embedded VLIW processors
Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design
The Garp Architecture and C Compiler
Computer
Imagine: Media Processing with Streams
IEEE Micro
Measuring the Performance of Multimedia Instruction Sets
IEEE Transactions on Computers
Optimizing Loop Performance for Clustered VLIW Architectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Multi-Media Extensions in Super-Pipelined Micro-Architectures. A New Case for SIMD Processing?
CAMP '00 Proceedings of the Fifth IEEE International Workshop on Computer Architectures for Machine Perception (CAMP'00)
Garp: a MIPS processor with a reconfigurable coprocessor
FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
Specifying and Compiling Applications for RaPiD
FCCM '98 Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines
Inter-Cluster Communication Models for Clustered VLIW Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
A Framework for Scheduling and Context Allocation in Reconfigurable Computing
Proceedings of the 12th international symposium on System synthesis
An 8x8 IDCT Implementation on an FPGA-Augmented TriMedia
FCCM '01 Proceedings of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
A survey of media processing approaches
IEEE Transactions on Circuits and Systems for Video Technology
The Equator MAP-CA™ DSP: an end-to-end broadband signal processor™ VLIW
IEEE Transactions on Circuits and Systems for Video Technology
Inter-cluster communication in VLIW architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Stream execution on wide-issue clustered VLIW architectures
Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
A coarse-grained reconfigurable architecture with compilation for high performance
International Journal of Reconfigurable Computing - Special issue on High-Performance Reconfigurable Computing
Hi-index | 0.00 |
Clustered VLIW architectures have been widely adopted in modern embedded multimedia applications for their ability to exploit high degrees of ILP with reasonable trade-off in complexity and silicon costs. Studies have however shown limited performance scaling for wide-issue machines. In this paper we describe the architecture of a clustered VLIW with a runtime reconfigurable inter-cluster bus suitable to address such scalability problem. The architecture is aimed at kernel loops acceleration through a coprocessor approach and allows a customization of the interconnect between neighboring register files before each loop execution. We have adopted an inter-cluster communication mechanism based on a constant-complexity interconnect. The complexity and latency independent of the number of clusters preserve the scalability on issue-width. To handle the limited connectivity, the interconnection resources in the inter-cluster bus are exposed to the compiler, and scheduled like other resources with an adapted version of modulo scheduling. Other relevant features include the capability to define shifting queues in the register files, for a more effective software pipelining support. The addition of a limited amount of reconfigurability to the well established VLIW programming model results in low-overhead inter-cluster communications and a scalable ILP architecture. Simulation results show that we can achieve near linear scalability for certain classes of kernel loops.