A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Data relocation and prefetching for programs with large data sets
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Cache miss heuristics and preloading techniques for general-purpose programs
Proceedings of the 28th annual international symposium on Microarchitecture
Thread scheduling for cache locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Tango: a hardware-based data prefetching technique for superscalar processors
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data prefetching for software DSMs
ICS '98 Proceedings of the 12th international conference on Supercomputing
Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors
IEEE Transactions on Parallel and Distributed Systems
Scheduling of uniform multidimensional systems under resource constraints
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Optimizing Overall Loop Schedules Using Prefetching and Partitioning
IEEE Transactions on Parallel and Distributed Systems
Scheduling and partitioning for multiple loop nests
Proceedings of the 14th international symposium on Systems synthesis
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps
ICPP '97 Proceedings of the international Conference on Parallel Processing
An adaptive sequential prefetching scheme in shared-memory multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
High-level synthesis of distributed logic-memory architectures
Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Loop Scheduling and Partitions for Hiding Memory Latencies
Proceedings of the 12th international symposium on System synthesis
Guided region prefetching: a cooperative hardware/software approach
Proceedings of the 30th annual international symposium on Computer architecture
Iterational retiming: maximize iteration-level parallelism for nested loops
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Partitioning and scheduling DSP applications with maximal memory access hiding
EURASIP Journal on Applied Signal Processing
Rotation scheduling: a loop pipelining algorithm
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Proceedings of the 47th Design Automation Conference
Loop fusion and reordering for register file optimization on stream processors
Journal of Systems and Software
Efficient Loop Scheduling for Chip Multiprocessors with Non-Volatile Main Memory
Journal of Signal Processing Systems
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors
Journal of Signal Processing Systems
Hi-index | 0.00 |
The widening gap between processor and memory performance is the main bottleneck for modern computer systems to achieve high processor utilization. To hide memory latency, a variety of techniques have been proposed—from intermediate fast memories (caches) to various prefetching and memory management techniques. In this article, we propose a new loop scheduling with memory management technique, Iterational Retiming with Partitioning (IRP), that can completely hide memory latencies for applications with multidimensional loops on architectures like CELL processor. In IRP, the iteration space is first partitioned carefully. Then a two-part schedule, consisting of processor and memory parts, is produced such that the execution time of the memory part never exceeds the execution time of the processor part. These two parts are executed simultaneously and complete memory latency hiding is reached. In this article, we prove that such optimal two-part schedule can always be achieved given the right partition size and shape. Experiments on DSP benchmarks show that IRP consistently produces optimal solutions as well as significant improvement over previous techniques.