Compiler-directed cache coherence strategies for large-scale shared-memory multiprocessor systems
Compiler-directed cache coherence strategies for large-scale shared-memory multiprocessor systems
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Code generation for streaming: an access/execute mechanism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An architecture for software-controlled data prefetching
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Comparative evaluation of latency reducing and tolerating techniques
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Comparison and analysis of software and directory coherence schemes
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Prefetch unit for vector operations on scalar computers
ACM SIGARCH Computer Architecture News
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
An efficient architecture for loop based data preloading
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Stride directed prefetching in scalar processors
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
An effective write policy for software coherence schemes
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Limitations of cache prefetching on a bus-based multiprocessor
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
ICS '93 Proceedings of the 7th international conference on Supercomputing
Using virtual lines to enhance locality exploitation
ICS '94 Proceedings of the 8th international conference on Supercomputing
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Access ordering and effective memory bandwidth
Access ordering and effective memory bandwidth
Data prefetching for high-performance processors
Data prefetching for high-performance processors
Compiler-directed data prefetching in multiprocessors with memory hierarchies
ICS '90 Proceedings of the 4th international conference on Supercomputing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ACM Computing Surveys (CSUR)
Dependence Analysis for Supercomputing
Dependence Analysis for Supercomputing
The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor
IEEE Transactions on Parallel and Distributed Systems
Lockup-free instruction fetch/prefetch cache organization
ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Principles of Compiler Design (Addison-Wesley series in computer science and information processing)
Principles of Compiler Design (Addison-Wesley series in computer science and information processing)
A compiler-directed data prefetching scheme for chip multiprocessors
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Implementing time-predictable load and store operations
EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
Adaptive prefetching for shared cache based chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
Tackling cache-line stealing effects using run-time adaptation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
SAD prefetching for MPEG4 using flux caches
SAMOS'06 Proceedings of the 6th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Hi-index | 0.00 |
Both hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherent in shared-memory multiprocessors; however, both types of prefetching have their shortcomings. While software schemes require less hardware support than hardware schemes, they must generate address calculation instructions and a prefetch instruction for each datum that needs to be prefetched. Hardware schemes, however, must become progressively more complex to be able to compute data access strides and to increase the prefetching lookahead. In this paper, we propose an integrated hardware/software prefetching method that uses simple hardware that can handle most data accesses and software prefetching for the few remaining accesses. A compile time algorithm analyzes the access streams formed by array references and determines sequences of consecutive memory accesses to an access stream that can be prefetched by the hardware mechanism. This analysis is based on the relative memory locations of consecutive accesses to an access stream and the number of intervening data references between consecutive accesses to an access stream. In addition, the prefetching lookahead can be set separately for each access stream. Our approach yields an effective scheme that minimizes both CPU overhead and hardware costs. Execution-driven simulations show our method to be very effective.