A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Cache miss heuristics and preloading techniques for general-purpose programs
Proceedings of the 28th annual international symposium on Microarchitecture
Scheduling of uniform multidimensional systems under resource constraints
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Optimizing Overall Loop Schedules Using Prefetching and Partitioning
IEEE Transactions on Parallel and Distributed Systems
Scheduling and partitioning for multiple loop nests
Proceedings of the 14th international symposium on Systems synthesis
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
An adaptive sequential prefetching scheme in shared-memory multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
Loop Scheduling and Partitions for Hiding Memory Latencies
Proceedings of the 12th international symposium on System synthesis
Guided region prefetching: a cooperative hardware/software approach
Proceedings of the 30th annual international symposium on Computer architecture
Iterational retiming: maximize iteration-level parallelism for nested loops
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Partitioning and scheduling DSP applications with maximal memory access hiding
EURASIP Journal on Applied Signal Processing
Rotation scheduling: a loop pipelining algorithm
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
A low-complexity microprocessor design with speculative pre-execution
Journal of Systems Architecture: the EUROMICRO Journal
Variable Partitioning and Scheduling for MPSoC with Virtually Shared Scratch Pad Memory
Journal of Signal Processing Systems
Algorithms for optimally arranging multicore memory structures
EURASIP Journal on Embedded Systems
Hi-index | 0.00 |
The widening gap between processor and memory performance is the main bottleneck for modern computer systems to achieve high processor utilization. In this paper, we propose a new loop scheduling with memory management technique, Iterational Retiming with Partitioning (IRP), that can completely hide memory latencies for applications with multi-dimensional loops on architectures like CELL processor [1]. In IRP, the iteration space is first partitioned carefully. Then a two-part schedule, consisting of processor and memory parts, is produced such that the execution time of the memory part never exceeds the execution time of the processor part. These two parts are executed simultaneously and complete memory latency hiding is reached. Experiments on DSP benchmarks show that IRP consistently produces optimal solutions as well as significant improvement over previous techniques.