The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor

Authors:
D. J. Lilja
Affiliations:
-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1994

Citing 20
Cited 3

Line (block) size choice for CPU cache memories

IEEE Transactions on Computers
Multiprocessor cache design considerations

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
Synchronization, Coherence, and Event Ordering in Multiprocessors

Computer
An evaluation of directory schemes for cache coherence

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Performance tradeoffs in cache design

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Memory-reference characteristics of multiprocessor applications under MACH

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The design of a lockup-free cache for high-performance multiprocessors

Proceedings of the 1988 ACM/IEEE conference on Supercomputing
Analysis of cache invalidation patterns in multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The effect of sharing on the cache and bus performance of parallel programs

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Factoring: a practical and robust method for scheduling parallel loops

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The performance impact of block sizes and fetch strategies

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)
Improving Memory Utilization in Cache Coherence Directories

IEEE Transactions on Parallel and Distributed Systems
Using cache memory to reduce processor-memory traffic

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract)

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture

Evaluating the impact of advanced memory systems on compiler-parallelized codes

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Characterizing the Memory Behavior of Compiler-Parallelized Applications

IEEE Transactions on Parallel and Distributed Systems
An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Trace-driven simulations of numerical Fortran programs are used to study the impact ofthe parallel loop scheduling strategy on data prefetching in a shared memorymultiprocessor with private data caches. The simulations indicate that to maximizememory performance, it is important to schedule blocks of consecutive iterations toexecute on each processor, and then to adaptively prefetch single-word cache blocks tomatch the number of iterations scheduled. Prefetching multiple single-word cache blockson a miss reduces the miss ratio by approximately 5% to 30% compared to a system withno prefetching. In addition, the proposed adaptive prefetching scheme further reducesthe miss ratio while significantly reducing the false sharing among cache blocks compared to nonadaptive prefetching strategies. Reducing the false sharing causes fewer coherence invalidations to be generated, and thereby reduces the total network traffic. The impact of the prefetching and scheduling strategies on the temporal distribution ofcoherence invalidations also is examined. It is found that invalidations tend to be evenlydistributed throughout the execution of parallel loops, but tend to be clustered whenexecuting sequential program sections. The distribution of invalidations in both types of program sections is relatively insensitive to the prefetching and scheduling strategy.