Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures

Authors:
Anup Gangwar;M. Balakrishnan;Anshul Kumar
Affiliations:
Freescale Semiconductor, NOIDA (UP), India;Indian Institute of Technology Delhi, New Delhi, India;Indian Institute of Technology Delhi, New Delhi, India
Venue:
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Year:
2007

Citing 28
Cited 4

IMPACT: an architectural framework for multiple-instruction-issue processors

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
Custom-fit processors: letting applications define architectures

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
The Transmogrifier-2: a 1 million gate rapid prototyping system

FPGA '97 Proceedings of the 1997 ACM fifth international symposium on Field-programmable gate arrays
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Space-time scheduling of instruction-level parallelism on a raw machine

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Lx: a technology platform for customizable VLIW embedded processing

Proceedings of the 27th annual international symposium on Computer architecture
Multiple-banked register file architectures

Proceedings of the 27th annual international symposium on Computer architecture
Communication scheduling

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
High-quality operation binding for clustered VLIW datapaths

Proceedings of the 38th annual Design Automation Conference
Instruction scheduling for clustered VLIW architectures

ISSS '00 Proceedings of the 13th international symposium on System synthesis
An interleaved cache clustered VLIW processor

ICS '02 Proceedings of the 16th international conference on Supercomputing
Modulo scheduling with integrated register spilling for clustered VLIW architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Exploring performance tradeoffs for clustered VLIW ASIPs

Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design
Scalable Processors in the Billion-Transistor Era: IRAM

Computer
Instruction-Level Distributed Processing

Computer
Design Challenges for New Application-Specific Processors

IEEE Design & Test
A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Treegion Scheduling for Highly Parallel Processors

Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Inter-Cluster Communication Models for Clustered VLIW Processors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Instruction Scheduling for Clustered VLIW DSPs

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Parallel Media Processors for the Billion-Transistor Era

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Improving dynamic cluster assignment for clustered trace cache processors

Proceedings of the 30th annual international symposium on Computer architecture
CARS: A New Code Generation Framework for Clustered ILP Processors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Evaluation of Bus Based Interconnect Mechanisms in Clustered VLIW Architectures

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Quantifying instruction-level parallelism limits on an EPIC architecture

ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software

Computation and data transfer co-scheduling for interconnection bus minimization

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Playing the trade-off game: Architecture exploration using Coffeee

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Optimizing scheduling and intercluster connection for application-specific DSP processors

IEEE Transactions on Signal Processing
COFFEE: compiler framework for energy-aware exploration

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers

Quantified Score

Hi-index	0.00

Visualization

Abstract

VLIW processors have started gaining acceptance in the embedded systems domain. However, monolithic register file VLIW processors with a large number of functional units are not viable. This is because of the need for a large number of ports to support FU requirements, which makes them expensive and extremely slow. A simple solution is to break the register file into a number of smaller register files with a subset of FUs connected to it. These architectures are termed clustered VLIW processors. In this article, we first build a case for clustered VLIW processors with four or more clusters by showing that the achievable ILP in most of the media applications for a 16 ALU and 8 LD/ST VLIW processor is around 20. We then provide a classification of the intercluster interconnection design space, and show that a large part of this design space is currently unexplored. Next, using our performance evaluation methodology, we evaluate a subset of this design space and show that the most commonly used type of interconnection, RF-to-RF, fails to meet achievable performance by a large factor, while certain other types of interconnections can lower this gap considerably. We also establish that this behavior is heavily application dependent, emphasizing the importance of application-specific architecture exploration. We also present results about the statistical behavior of these different architectures by varying the number of clusters in our framework from 4 to 16. These results clearly show the advantages of one specific architecture over others. Finally, based on our results, we propose a new interconnection network, which should lower this performance gap.