Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors

Authors:
Naraig Manjikia
Affiliations:
-
Venue:
ICPP '97 Proceedings of the international Conference on Parallel Processing
Year:
1997

Citing 10
Cited 8

Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Fusion of Loops for Parallelism and Locality

IEEE Transactions on Parallel and Distributed Systems
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Performance analysis using the MIPS R10000 performance counters

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Scalable Shared-Memory Multiprocessing

Scalable Shared-Memory Multiprocessing
The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

Optimizing Overall Loop Schedules Using Prefetching and Partitioning

IEEE Transactions on Parallel and Distributed Systems
Improving Memory Traffic by Assembly-Level Exploitation of Reuses for Vector Registers

The Journal of Supercomputing
Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching

Journal of VLSI Signal Processing Systems
Functional array fusion

Proceedings of the sixth ACM SIGPLAN international conference on Functional programming
Data parallel Haskell: a status report

Proceedings of the 2007 workshop on Declarative aspects of multicore programming
Partitioning and scheduling DSP applications with maximal memory access hiding

EURASIP Journal on Applied Signal Processing
Effective loop partitioning and scheduling under memory and register dual constraints

Proceedings of the conference on Design, automation and test in Europe
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of programs consisting of parallel loops on shared-memory multiprocessors is limited by long memory latencies as processor speeds increase more rapidly than memory speeds. Two complementary techniques for addressing memory latency and improving performance are: (a) cache locality enhancement for latency reduction and (b) data prefetching for latency tolerance. This paper studies the benefit of combining loop fusion for locality enhancement with prefetching. Experimental results are reported for multiprocessors with support for prefetching. For a complete application on an SGI Power Challenge R10000, combining loop fusion with prefetching improves parallel speedup by 46%.