Improving locality and parallelism in nested loops
Improving locality and parallelism in nested loops
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Fusion of Loops for Parallelism and Locality
IEEE Transactions on Parallel and Distributed Systems
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
Performance analysis using the MIPS R10000 performance counters
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Scalable Shared-Memory Multiprocessing
Scalable Shared-Memory Multiprocessing
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Software methods for improvement of cache performance on supercomputer applications
Software methods for improvement of cache performance on supercomputer applications
Optimizing Overall Loop Schedules Using Prefetching and Partitioning
IEEE Transactions on Parallel and Distributed Systems
Improving Memory Traffic by Assembly-Level Exploitation of Reuses for Vector Registers
The Journal of Supercomputing
Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching
Journal of VLSI Signal Processing Systems
Proceedings of the sixth ACM SIGPLAN international conference on Functional programming
Data parallel Haskell: a status report
Proceedings of the 2007 workshop on Declarative aspects of multicore programming
Partitioning and scheduling DSP applications with maximal memory access hiding
EURASIP Journal on Applied Signal Processing
Effective loop partitioning and scheduling under memory and register dual constraints
Proceedings of the conference on Design, automation and test in Europe
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
The performance of programs consisting of parallel loops on shared-memory multiprocessors is limited by long memory latencies as processor speeds increase more rapidly than memory speeds. Two complementary techniques for addressing memory latency and improving performance are: (a) cache locality enhancement for latency reduction and (b) data prefetching for latency tolerance. This paper studies the benefit of combining loop fusion for locality enhancement with prefetching. Experimental results are reported for multiprocessors with support for prefetching. For a complete application on an SGI Power Challenge R10000, combining loop fusion with prefetching improves parallel speedup by 46%.