Optimizing Overall Loop Schedules Using Prefetching and Partitioning

Authors:
Fei Chen;Timothy W. O'Neil;Edwin H.-M. Sha
Affiliations:
Univ. of Notre Dame, Notre Dame, IN;Univ. of Notre Dame, Notre Dame, IN;Univ. of Notre Dame, Notre Dame, IN
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2000

Citing 16
Cited 10

A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Data relocation and prefetching for programs with large data sets

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Resource-constrained loop list scheduler for DSP algorithms

Journal of VLSI Signal Processing Systems - Special issue on VLSI design methodologies for digital signal processing systems
Cache miss heuristics and preloading techniques for general-purpose programs

Proceedings of the 28th annual international symposium on Microarchitecture
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Modulo scheduling of loops in control-intensive non-numeric programs

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Tango: a hardware-based data prefetching technique for superscalar processors

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Combining loop transformations considering caches and scheduling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Achieving Full Parallelism Using Multidimensional Retiming

IEEE Transactions on Parallel and Distributed Systems
Data prefetching for software DSMs

ICS '98 Proceedings of the 12th international conference on Supercomputing
Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors

IEEE Transactions on Parallel and Distributed Systems
Scheduling of uniform multidimensional systems under resource constraints

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps

ICPP '97 Proceedings of the international Conference on Parallel Processing
An adaptive sequential prefetching scheme in shared-memory multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing

Optimal code size reduction for software-pipelined and unfolded loops

Proceedings of the 15th international symposium on System Synthesis
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping

The Journal of Supercomputing
Code size reduction technique and implementation for software-pipelined DSP applications

ACM Transactions on Embedded Computing Systems (TECS)
Data dependent loop scheduling based on genetic algorithms for distributed and shared memory systems

Journal of Parallel and Distributed Computing
Loop Scheduling with Complete Memory Latency Hiding on Multi-core Architecture

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Energy saving for memory with loop scheduling and prefetching

Proceedings of the 18th ACM Great Lakes symposium on VLSI
Effective loop partitioning and scheduling under memory and register dual constraints

Proceedings of the conference on Design, automation and test in Europe
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding

ACM Transactions on Embedded Computing Systems (TECS)
Variable Partitioning and Scheduling for MPSoC with Virtually Shared Scratch Pad Memory

Journal of Signal Processing Systems
Loop Distribution and Fusion with Timing and Code Size Optimization

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, a method combining the loop pipelining technique with data prefetching, called Partition Scheduling with Prefetching (PSP), is proposed. In PSP, the iteration space is first divided into regular partitions. Then a two-part schedule, consisting of the ALU and memory parts, is produced and balanced to produce high throughput. These two parts are executed simultaneously, and hence, the remote memory latencies are overlapped. We study the optimal partition shape and size so that a well-balanced overall schedule can be obtained. Experiments on DSP benchmarks show that the proposed methodology consistently produces optimal or near optimal solutions.