Cache sensitive modulo scheduling

Authors:
F. Jesús Sánchez;Antonio González
Affiliations:
Department of Computer Architecture, Universitat Politècnica de Catalunya, Campus Nord - c./ Jordi Girona, 1-3 - Mòdul D6, 08034 - Barcelona, Spain;Department of Computer Architecture, Universitat Politècnica de Catalunya, Campus Nord - c./ Jordi Girona, l-3 - Mòdul D6, 08034 - Barcelona, Spain
Venue:
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Year:
1997

Citing 19
Cited 14

Software pipelining: an effective scheduling technique for VLIW machines

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Circular scheduling: a new technique to perform software pipelining

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Lifetime-sensitive modulo scheduling

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Balanced scheduling: instruction scheduling when memory latency is uncertain

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Evolution of the PowerPC Architecture

IEEE Micro
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Minimizing register requirements under resource-constrained rate-optimal software pipelining

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Optimum modulo schedules for minimum register requirements

ICS '95 Proceedings of the 9th international conference on Supercomputing
Compiler techniques for data prefetching on the PowerPC

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Stage scheduling: a technique to reduce the register requirements of a modulo schedule

Proceedings of the 28th annual international symposium on Microarchitecture
Hypernode reduction modulo scheduling

Proceedings of the 28th annual international symposium on Microarchitecture
Predictability of load/store instruction latencies

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Decomposed Software Pipelining: A New Approach to Exploit Instruction Level Parallelism for Loop Programs

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
Static Locality Analysis for Cache Management

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
Swing Modulo Scheduling: A Lifetime-Sensitive Approach

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques

Modulo scheduling for a fully-distributed clustered VLIW architecture

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Two-level hierarchical register file organization for VLIW processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Instruction scheduling for clustered VLIW architectures

ISSS '00 Proceedings of the 13th international symposium on System synthesis
An interleaved cache clustered VLIW processor

ICS '02 Proceedings of the 16th international conference on Supercomputing
Graph-partitioning based instruction scheduling for clustered processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Modulo scheduling with integrated register spilling for clustered VLIW architectures

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Exploiting Pseudo-Schedules to Guide Data Dependence Graph Partitioning

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The Effectiveness of Loop Unrolling for Modulo Scheduling in Clustered VLIW Architectures

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Distributed Data Cache Designs for Clustered VLIW Processors

IEEE Transactions on Computers
Software and hardware techniques to optimize register file utilization in VLIW architectures

International Journal of Parallel Programming
Latency-tolerant software pipelining in a production compiler

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
An algorithm to improve parallelism in distributed systems using asynchronous calls

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper focuses on the interaction between software prefetching (both binding and nonbinding) and software pipelining for VLIW machines. First, it is shown that evaluating software pipelined schedules without considering memory effects can be rather inaccurate due to stalls caused by dependences with memory instructions (even if a lockup-free cache is considered). It is also shown that the penalty of the stalls is in general higher than the effect of spill code. Second, we show that in general binding schemes are more powerful than nonbinding ones for software pipelined schedules. Finally, the main contribution of this paper is an heuristic scheme that schedules some memory operations according to the locality estimated at compile time and other attributes of the dependence graph. The proposed scheme is shown to outperform other heuristic approaches since it achieves a better trade-off between compute and stall time than the others.