An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors

Authors:
Edward H. Gornish;Alexander Veidenbaum
Affiliations:
-;-
Venue:
International Journal of Parallel Programming
Year:
1999

Citing 29
Cited 6

Compiler-directed cache coherence strategies for large-scale shared-memory multiprocessor systems

Compiler-directed cache coherence strategies for large-scale shared-memory multiprocessor systems
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Code generation for streaming: an access/execute mechanism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Comparison and analysis of software and directory coherence schemes

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Prefetch unit for vector operations on scalar computers

ACM SIGARCH Computer Architecture News
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
An efficient architecture for loop based data preloading

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
An effective write policy for software coherence schemes

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Limitations of cache prefetching on a bus-based multiprocessor

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Speculative prefetching

ICS '93 Proceedings of the 7th international conference on Supercomputing
Using virtual lines to enhance locality exploitation

ICS '94 Proceedings of the 8th international conference on Supercomputing
A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Access ordering and effective memory bandwidth

Access ordering and effective memory bandwidth
Data prefetching for high-performance processors

Data prefetching for high-performance processors
Compiler-directed data prefetching in multiprocessors with memory hierarchies

ICS '90 Proceedings of the 4th international conference on Supercomputing
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor

IEEE Transactions on Parallel and Distributed Systems
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Principles of Compiler Design (Addison-Wesley series in computer science and information processing)

Principles of Compiler Design (Addison-Wesley series in computer science and information processing)

Boosting data throughput for sequence database similarity searches on FPGAs using an adaptive buffering scheme

Parallel Computing
A compiler-directed data prefetching scheme for chip multiprocessors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Implementing time-predictable load and store operations

EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
Adaptive prefetching for shared cache based chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Tackling cache-line stealing effects using run-time adaptation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
SAD prefetching for MPEG4 using flux caches

SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Both hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherent in shared-memory multiprocessors; however, both types of prefetching have their shortcomings. While software schemes require less hardware support than hardware schemes, they must generate address calculation instructions and a prefetch instruction for each datum that needs to be prefetched. Hardware schemes, however, must become progressively more complex to be able to compute data access strides and to increase the prefetching lookahead. In this paper, we propose an integrated hardware/software prefetching method that uses simple hardware that can handle most data accesses and software prefetching for the few remaining accesses. A compile time algorithm analyzes the access streams formed by array references and determines sequences of consecutive memory accesses to an access stream that can be prefetched by the hardware mechanism. This analysis is based on the relative memory locations of consecutive accesses to an access stream and the number of intervening data references between consecutive accesses to an access stream. In addition, the prefetching lookahead can be set separately for each access stream. Our approach yields an effective scheme that minimizes both CPU overhead and hardware costs. Execution-driven simulations show our method to be very effective.