The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

Authors:
Jesús Sánchez;Antonio González
Affiliations:
-;-
Venue:
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Year:
2000

Citing 18
Cited 22

Bulldog: a compiler for VLSI architectures

Bulldog: a compiler for VLSI architectures
Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The multiscalar architecture

The multiscalar architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Unrolling-based optimizations for modulo scheduling

Proceedings of the 28th annual international symposium on Microarchitecture
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Cache sensitive modulo scheduling

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Exploiting idle floating-point resources for integer execution

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Clustered speculative multithreaded processors

ICS '99 Proceedings of the 13th international conference on Supercomputing
Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing

MICRO 14 Proceedings of the 14th annual workshop on Microprogramming
Distributed Modulo Scheduling

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Swing Modulo Scheduling: A Lifetime-Sensitive Approach

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques

Modulo scheduling for a fully-distributed clustered VLIW architecture

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Lifetime-Sensitive Modulo Scheduling in a Production Environment

IEEE Transactions on Computers
Loop Transformations for Architectures with Partitioned Register Banks

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Tailoring pipeline bypassing and functional unit mapping to application in clustered VLIW architectures

CASES '01 Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems
Loop fusion for clustered VLIW architectures

Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
Affinity-based cluster assignment for unrolled loops

ICS '02 Proceedings of the 16th international conference on Supercomputing
An interleaved cache clustered VLIW processor

ICS '02 Proceedings of the 16th international conference on Supercomputing
Graph-partitioning based instruction scheduling for clustered processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Modulo scheduling with integrated register spilling for clustered VLIW architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Optimizing Loop Performance for Clustered VLIW Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Instruction Replication for Clustered Microarchitectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Removing communications in clustered microarchitectures through instruction replication

ACM Transactions on Architecture and Code Optimization (TACO)
Distributed Data Cache Designs for Clustered VLIW Processors

IEEE Transactions on Computers
Software and hardware techniques to optimize register file utilization in VLIW architectures

International Journal of Parallel Programming
Virtual Cluster Scheduling Through the Scheduling Graph

Proceedings of the International Symposium on Code Generation and Optimization
Heterogeneous Clustered VLIW Microarchitectures

Proceedings of the International Symposium on Code Generation and Optimization
Automated dynamic throughput-constrained structural-level pipelining in streaming applications

Proceedings of the conference on Design, automation and test in Europe
Function inlining and loop unrolling for loop acceleration in reconfigurable processors

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Clustered organizations are becoming a common trend in the design of VLIW architectures. In this work, we propose a novel modulo scheduling approach for such architectures. The proposed technique performs the cluster assignment and the instruction scheduling in a single pass, which is shown to be more effective than doing first the assignment and later the scheduling. We also show that loop unrolling significantly enhances the performance of the proposed scheduler, especially when the communication channel among clusters is the main performance bottleneck. By selectively unrolling some loops, we can obtain the best performance with the minimum increase in code size. Performance evaluation for the SPECfp95 shows that the clustered architecture achieves about the same IPC (Instructions Per Cycle) as a unified architecture with the same resources. Moreover, when the cycle time is taken into account, a 4-cluster configuration is 3.6 times faster than the unified architecture.