A Distributed Control Path Architecture for VLIW Processors

Authors:
Hongtao Zhong;Kevin Fan;Scott Mahlke;Michael Schlansker
Affiliations:
Advanced Computer Architecture Laboratory University of Michigan - Ann Arbor, MI;Advanced Computer Architecture Laboratory University of Michigan - Ann Arbor, MI;Advanced Computer Architecture Laboratoryv University of Michigan - Ann Arbor, MI;Hewlett Packard Laboratories
Venue:
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Year:
2005

Citing 28
Cited 7

Bulldog: a compiler for VLSI architectures

Bulldog: a compiler for VLSI architectures
The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions, and Trade-Offs

Computer
A variable instruction stream extension to the VLIW architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Architecture and implementation of a VLIW supercomputer

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Instruction selection using binate covering for code size optimization

ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
Instruction fetch mechanisms for VLIW architectures with compressed encodings

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
The design of a high performance low power microprocessor

ISLPED '96 Proceedings of the 1996 international symposium on Low power electronics and design
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Compiler-driven cached code compression schemes for embedded ILP processors

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Modulo scheduling for a fully-distributed clustered VLIW architecture

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Code size minimization and retargetable assembly for custom EPIC and VLIW instruction formats

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Modeling and minimization of interconnect energy dissipation in nanometer technologies

Proceedings of the 38th annual Design Automation Conference
An interleaved cache clustered VLIW processor

ICS '02 Proceedings of the 16th international conference on Supercomputing
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A code decompression architecture for VLIW processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Will Physical Scalability Sabotage Performance Gains?

Computer
Compression of Embedded System Programs

ICCS '94 Proceedings of the1994 IEEE International Conference on Computer Design: VLSI in Computer & Processors
Balancing Fine- and Medium-Grained Parallelism in Scheduling Loops for the XIMD Architecture

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Region-based hierarchical operation partitioning for multicluster processors

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Very Long Instruction Word architectures and the ELI-512

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
Instruction Replication for Clustered Microarchitectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Power-driven Design of Router Microarchitectures in On-chip Networks

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams

Proceedings of the 31st annual international symposium on Computer architecture

Heterogeneous Clustered VLIW Microarchitectures

Proceedings of the International Symposium on Code Generation and Optimization
Convergent Compilation Applied to Loop Unrolling

Transactions on High-Performance Embedded Architectures and Compilers I
Polymorphic pipeline array: a flexible multicore accelerator with virtualized execution for mobile multimedia applications

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Resource recycling: putting idle resources to work on a composable accelerator

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Survey of Low-Energy Techniques for Instruction Memory Organisations in Embedded Systems

Journal of Signal Processing Systems
Design Space Exploration of Distributed Loop Buffer Architectures with Incompatible Loop-Nest Organisations in Embedded Systems

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

VLIW architectures are popular in embedded systems because they offer high-performance processing at low cost and energy. The major problem with traditional VLIW designs is that they do not scale efficiently due to bottlenecks that result from centralized resources and global communication. Multicluster designs have been proposed to solve the scaling problem of VLIW datapaths, while much less work has been done on the control path. In this paper, we propose a distributed control path architecture for VLIW processors (DVLIW) to overcome the scalability problem of VLIW control paths. The architecture simplifies the dispersal of complex VLIW instructions and supports efficient distribution of instructions through a limited bandwidth interconnect, while supporting compressed instruction encodings. DVLIW employs a multicluster design where each cluster contains a local instruction memory that provides all intra-cluster control. All clusters have their own program counter and instruction sequencing capabilities, thus instruction execution is completely decentralized. The architecture executes multiple instruction streams at the same time, but these streams collectively function as a single logical instruction stream. Simulation results show that DVLIWprocessors reduce the number of cross-chip control signals by approximately two orders of magnitude while incurring a small performance overhead to explicitly manage the instruction streams.