Loop Scheduling with Complete Memory Latency Hiding on Multi-core Architecture

Authors:
Chun Xue;Zili Shao;Meilin Liu;Meikang Qiu;Edwin H. -M. Sha
Affiliations:
University of Texas at Dallas, USA;Hong Kong Polytechnic University, Hong Kong;University of Texas at Dallas, USA;University of Texas at Dallas, USA;University of Texas at Dallas, USA
Venue:
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Year:
2006

Citing 13
Cited 3

A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Cache miss heuristics and preloading techniques for general-purpose programs

Proceedings of the 28th annual international symposium on Microarchitecture
Scheduling of uniform multidimensional systems under resource constraints

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Optimizing Overall Loop Schedules Using Prefetching and Partitioning

IEEE Transactions on Parallel and Distributed Systems
Scheduling and partitioning for multiple loop nests

Proceedings of the 14th international symposium on Systems synthesis
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
An adaptive sequential prefetching scheme in shared-memory multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing
Loop Scheduling and Partitions for Hiding Memory Latencies

Proceedings of the 12th international symposium on System synthesis
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Iterational retiming: maximize iteration-level parallelism for nested loops

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Partitioning and scheduling DSP applications with maximal memory access hiding

EURASIP Journal on Applied Signal Processing
Rotation scheduling: a loop pipelining algorithm

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

A low-complexity microprocessor design with speculative pre-execution

Journal of Systems Architecture: the EUROMICRO Journal
Variable Partitioning and Scheduling for MPSoC with Virtually Shared Scratch Pad Memory

Journal of Signal Processing Systems
Algorithms for optimally arranging multicore memory structures

EURASIP Journal on Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The widening gap between processor and memory performance is the main bottleneck for modern computer systems to achieve high processor utilization. In this paper, we propose a new loop scheduling with memory management technique, Iterational Retiming with Partitioning (IRP), that can completely hide memory latencies for applications with multi-dimensional loops on architectures like CELL processor [1]. In IRP, the iteration space is first partitioned carefully. Then a two-part schedule, consisting of processor and memory parts, is produced such that the execution time of the memory part never exceeds the execution time of the processor part. These two parts are executed simultaneously and complete memory latency hiding is reached. Experiments on DSP benchmarks show that IRP consistently produces optimal solutions as well as significant improvement over previous techniques.