Bulldog: a compiler for VLSI architectures
Bulldog: a compiler for VLSI architectures
A variable instruction stream extension to the VLIW architecture
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Architecture and implementation of a VLIW supercomputer
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction selection using binate covering for code size optimization
ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
Instruction fetch mechanisms for VLIW architectures with compressed encodings
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
The design of a high performance low power microprocessor
ISLPED '96 Proceedings of the 1996 international symposium on Low power electronics and design
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Compiler-driven cached code compression schemes for embedded ILP processors
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
Modulo scheduling for a fully-distributed clustered VLIW architecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Code size minimization and retargetable assembly for custom EPIC and VLIW instruction formats
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Modeling and minimization of interconnect energy dissipation in nanometer technologies
Proceedings of the 38th annual Design Automation Conference
An interleaved cache clustered VLIW processor
ICS '02 Proceedings of the 16th international conference on Supercomputing
An instruction set and microarchitecture for instruction level distributed processing
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A code decompression architecture for VLIW processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Compression of Embedded System Programs
ICCS '94 Proceedings of the1994 IEEE International Conference on Computer Design: VLSI in Computer & Processors
Balancing Fine- and Medium-Grained Parallelism in Scheduling Loops for the XIMD Architecture
PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Region-based hierarchical operation partitioning for multicluster processors
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Very Long Instruction Word architectures and the ELI-512
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Proceedings of the 30th annual international symposium on Computer architecture
Instruction Replication for Clustered Microarchitectures
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Power-driven Design of Router Microarchitectures in On-chip Networks
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams
Proceedings of the 31st annual international symposium on Computer architecture
Heterogeneous Clustered VLIW Microarchitectures
Proceedings of the International Symposium on Code Generation and Optimization
Convergent Compilation Applied to Loop Unrolling
Transactions on High-Performance Embedded Architectures and Compilers I
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Resource recycling: putting idle resources to work on a composable accelerator
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Survey of Low-Energy Techniques for Instruction Memory Organisations in Embedded Systems
Journal of Signal Processing Systems
Journal of Signal Processing Systems
Hi-index | 0.00 |
VLIW architectures are popular in embedded systems because they offer high-performance processing at low cost and energy. The major problem with traditional VLIW designs is that they do not scale efficiently due to bottlenecks that result from centralized resources and global communication. Multicluster designs have been proposed to solve the scaling problem of VLIW datapaths, while much less work has been done on the control path. In this paper, we propose a distributed control path architecture for VLIW processors (DVLIW) to overcome the scalability problem of VLIW control paths. The architecture simplifies the dispersal of complex VLIW instructions and supports efficient distribution of instructions through a limited bandwidth interconnect, while supporting compressed instruction encodings. DVLIW employs a multicluster design where each cluster contains a local instruction memory that provides all intra-cluster control. All clusters have their own program counter and instruction sequencing capabilities, thus instruction execution is completely decentralized. The architecture executes multiple instruction streams at the same time, but these streams collectively function as a single logical instruction stream. Simulation results show that DVLIWprocessors reduce the number of cross-chip control signals by approximately two orders of magnitude while incurring a small performance overhead to explicitly manage the instruction streams.