The effect of sharing on the cache and bus performance of parallel programs
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Memory Access Dependencies in Shared-Memory Multiprocessors
IEEE Transactions on Software Engineering
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance evaluation of memory consistency models for shared-memory multiprocessors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors
Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Data prefetching in multiprocessor vector cache memories
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The Stanford Dash Multiprocessor
Computer
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
Cache Invalidation Patterns in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The detection and elimination of useless misses in multiprocessors
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Combined performance gains of simple cache protocol extensions
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Weak ordering—a new definition
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
ACM Computing Surveys (CSUR)
Portable Programs for Parallel Processors
Portable Programs for Parallel Processors
Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Limits on the performance benefits of multithreading and prefetching
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Adaptive data prefetching using cache information
ICS '97 Proceedings of the 11th international conference on Supercomputing
Prediction caches for superscalar processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Characterization and improvement of load/store cache-based prefetching
ICS '98 Proceedings of the 12th international conference on Supercomputing
CPU Cache Prefetching: Timing Evaluation of Hardware Implementations
IEEE Transactions on Computers
IEEE Transactions on Computers
Hardware-only stream prefetching and dynamic access ordering
Proceedings of the 14th international conference on Supercomputing
Optimal two level partitioning and loop scheduling for hiding memory latency for DSP applications
Proceedings of the 37th Annual Design Automation Conference
Optimizing Overall Loop Schedules Using Prefetching and Partitioning
IEEE Transactions on Parallel and Distributed Systems
Optimal partitioning and balanced scheduling with the maximal overlap of data footprints
GLSVLSI '01 Proceedings of the 11th Great Lakes symposium on VLSI
Dynamic Access Ordering for Streamed Computations
IEEE Transactions on Computers
Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching
Journal of VLSI Signal Processing Systems
Designing a Modern Memory Hierarchy with Hardware Prefetching
IEEE Transactions on Computers
Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps
ICPP '97 Proceedings of the international Conference on Parallel Processing
A Hardware Scheme for Data Prefetching
HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
Experiments with Sequential Prefetching
HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Sequential Unification and Aggressive Lookahead Mechanisms for Data Memory Accesses
PaCT '999 Proceedings of the 5th International Conference on Parallel Computing Technologies
Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Loop Scheduling and Partitions for Hiding Memory Latencies
Proceedings of the 12th international symposium on System synthesis
Effective stream-based and execution-based data prefetching
Proceedings of the 18th annual international conference on Supercomputing
On the importance of optimizing the configuration of stream prefetchers
Proceedings of the 2005 workshop on Memory system performance
Loop Scheduling with Complete Memory Latency Hiding on Multi-core Architecture
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Partitioning and scheduling DSP applications with maximal memory access hiding
EURASIP Journal on Applied Signal Processing
A case for low-complexity MP architectures
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Data access history cache and associated data prefetching mechanisms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Server-based data push architecture for multi-processor environments
Journal of Computer Science and Technology
Effective loop partitioning and scheduling under memory and register dual constraints
Proceedings of the conference on Design, automation and test in Europe
Off-loading application controlled data prefetching in numerical codes for multi-core processors
International Journal of Computational Science and Engineering
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding
ACM Transactions on Embedded Computing Systems (TECS)
An Adaptive Data Prefetcher for High-Performance Processors
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Bandwidth constrained coordinated HW/SW prefetching for multicores
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Using runtime activity to dynamically filter out inefficient data prefetches
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Global-aware and multi-order context-based prefetching for high-performance processors
International Journal of High Performance Computing Applications
Optimizing explicit data transfers for data parallel applications on the cell architecture
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Reducing Power and Energy Overhead in Instruction Prefetching for Embedded Processor Systems
International Journal of Handheld Computing Research
Removal of Conflicts in Hardware Transactional Memory Systems
International Journal of Parallel Programming
Hi-index | 0.01 |
To offset the effect of read miss penalties on processor utilization in shared-memory multiprocessors, several software- and hardware-based data prefetching schemes have been proposed. A major advantage of hardware techniques is that they need no support from the programmer or compiler.Sequential prefetching is a simple hardware-controlled prefetching technique which relies on the automatic prefetch of consecutive blocks following the block that misses in the cache, thus exploiting spatial locality. In its simplest form, the number of prefetched blocks on each miss is fixed throughout the execution. However, since the prefetching efficiency varies during the execution of a program, we propose to adapt the number of prefetched blocks according to a dynamic measure of prefetching effectiveness. Simulations of this adaptive scheme show reductions of the number of read misses, the read penalty, and of the execution time by up to 78%, 58%, and 25% respectively.