A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Data relocation and prefetching for programs with large data sets
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Resource-constrained loop list scheduler for DSP algorithms
Journal of VLSI Signal Processing Systems - Special issue on VLSI design methodologies for digital signal processing systems
Cache miss heuristics and preloading techniques for general-purpose programs
Proceedings of the 28th annual international symposium on Microarchitecture
Thread scheduling for cache locality
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Modulo scheduling of loops in control-intensive non-numeric programs
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Tango: a hardware-based data prefetching technique for superscalar processors
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Combining loop transformations considering caches and scheduling
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Achieving Full Parallelism Using Multidimensional Retiming
IEEE Transactions on Parallel and Distributed Systems
Data prefetching for software DSMs
ICS '98 Proceedings of the 12th international conference on Supercomputing
Modeled and Measured Instruction Fetching Performance for Superscalar Microprocessors
IEEE Transactions on Parallel and Distributed Systems
Scheduling of uniform multidimensional systems under resource constraints
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps
ICPP '97 Proceedings of the international Conference on Parallel Processing
An adaptive sequential prefetching scheme in shared-memory multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
Optimal code size reduction for software-pipelined and unfolded loops
Proceedings of the 15th international symposium on System Synthesis
Reducing Cache Conflicts by Multi-Level Cache Partitioning and Array Elements Mapping
The Journal of Supercomputing
Code size reduction technique and implementation for software-pipelined DSP applications
ACM Transactions on Embedded Computing Systems (TECS)
Data dependent loop scheduling based on genetic algorithms for distributed and shared memory systems
Journal of Parallel and Distributed Computing
Loop Scheduling with Complete Memory Latency Hiding on Multi-core Architecture
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Energy saving for memory with loop scheduling and prefetching
Proceedings of the 18th ACM Great Lakes symposium on VLSI
Effective loop partitioning and scheduling under memory and register dual constraints
Proceedings of the conference on Design, automation and test in Europe
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding
ACM Transactions on Embedded Computing Systems (TECS)
Variable Partitioning and Scheduling for MPSoC with Virtually Shared Scratch Pad Memory
Journal of Signal Processing Systems
Loop Distribution and Fusion with Timing and Code Size Optimization
Journal of Signal Processing Systems
Hi-index | 0.00 |
In this paper, a method combining the loop pipelining technique with data prefetching, called Partition Scheduling with Prefetching (PSP), is proposed. In PSP, the iteration space is first divided into regular partitions. Then a two-part schedule, consisting of the ALU and memory parts, is produced and balanced to produce high throughput. These two parts are executed simultaneously, and hence, the remote memory latencies are overlapped. We study the optimal partition shape and size so that a well-balanced overall schedule can be obtained. Experiments on DSP benchmarks show that the proposed methodology consistently produces optimal or near optimal solutions.