A performance study of software and hardware data prefetching schemes

Authors:
T.-F. Chen;J.-L. Baer
Affiliations:
Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan, R.O.C;Department of Computer Science and Engineering, University of Washington, Seattle, WA
Venue:
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Year:
1994

Citing 16
Cited 69

Memory access buffering in multiprocessors

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Limitations of cache prefetching on a bus-based multiprocessor

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Data prefetching for high-performance processors

Data prefetching for high-performance processors
Compiler-directed data prefetching in multiprocessors with memory hierarchies

ICS '90 Proceedings of the 4th international conference on Supercomputing
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture

Effective cache prefetching on bus-based multiprocessors

ACM Transactions on Computer Systems (TOCS)
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A data cache with multiple caching strategies tuned to different types of locality

ICS '95 Proceedings of the 9th international conference on Supercomputing
Compiler techniques for data prefetching on the PowerPC

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Limits on the performance benefits of multithreading and prefetching

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Value locality and load value prediction

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Exceeding the dataflow limit via value prediction

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
CP-PACS: a massively parallel processor for large scale scientific calculations

ICS '97 Proceedings of the 11th international conference on Supercomputing
Speculative execution via address prediction and data prefetching

ICS '97 Proceedings of the 11th international conference on Supercomputing
Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs

ICS '98 Proceedings of the 12th international conference on Supercomputing
Speculative multithreaded processors

ICS '98 Proceedings of the 12th international conference on Supercomputing
Characterization and improvement of load/store cache-based prefetching

ICS '98 Proceedings of the 12th international conference on Supercomputing
A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors

ICS '98 Proceedings of the 12th international conference on Supercomputing
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Effects of Multithreading on Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors

International Journal of Parallel Programming
Optimizing Overall Loop Schedules Using Prefetching and Partitioning

IEEE Transactions on Parallel and Distributed Systems
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Access pattern based local memory customization for low power embedded systems

Proceedings of the conference on Design, automation and test in Europe
Minimizing Average Schedule Length under Memory Constraints by Optimal Partitioning and Prefetching

Journal of VLSI Signal Processing Systems
Exploring multimedia applications locality to improve cache performance

MULTIMEDIA '00 Proceedings of the eighth ACM international conference on Multimedia
Avoiding initialization misses to the heap

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A Simulation Study of Decoupled Vector Architectures

The Journal of Supercomputing
Exploiting Value Locality to Exceed the Dataflow Limit

International Journal of Parallel Programming
When Caches Aren't Enough: Data Prefetching Techniques

Computer
Boosting the Performance of Shared Memory Multiprocessors

Computer
Data Cache and Direct Memory Access in Programming Mediaprocessors

IEEE Micro
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Analysis of performance bottlenecks in multithreaded multiprocessor systems

Fundamenta Informaticae - Application of concurrency to system design
Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps

ICPP '97 Proceedings of the international Conference on Parallel Processing
Exploiting the Prefetching Effect Provided by Executing Mispredicted Load Instructions

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Content-Based Prefetching: Initial Results

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Overlap of computation and communication on shared-memory networks-of-workstations

Cluster computing
Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Decoupled vector architectures

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Distributed Prefetch-buffer/Cache Design for High Performance Memory Systems

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
The Architecture of Massively Parallel Processor CP-PACS

PAS '97 Proceedings of the 2nd AIZU International Symposium on Parallel Algorithms / Architecture Synthesis
DRAM-Page Based Prediction and Prefetching

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
A Prefetch Taxonomy

IEEE Transactions on Computers
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
When prefetching improves/degrades performance

Proceedings of the 2nd conference on Computing frontiers
Exploring the limits of prefetching

IBM Journal of Research and Development - Electrochemical technology in microelectronics
Memory Performance Optimizations For Real-Time Software HDTV Decoding

Journal of VLSI Signal Processing Systems
Energy-efficient and high-performance instruction fetch using a block-aware ISA

ISLPED '05 Proceedings of the 2005 international symposium on Low power electronics and design
Memory access pattern analysis and stream cache design for multimedia applications

ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
Loop Scheduling with Complete Memory Latency Hiding on Multi-core Architecture

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Block-aware instruction set architecture

ACM Transactions on Architecture and Code Optimization (TACO)
3DString: a feature string kernel for 3D object classification on voxelized data

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Semantic fitting and reconstruction

Journal on Computing and Cultural Heritage (JOCCH)
A compiler-directed data prefetching scheme for chip multiprocessors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Architecture of the internet archive

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding

ACM Transactions on Embedded Computing Systems (TECS)
Variable Partitioning and Scheduling for MPSoC with Virtually Shared Scratch Pad Memory

Journal of Signal Processing Systems
MADE: a composite visual-based 3D shape descriptor

MIRAGE'07 Proceedings of the 3rd international conference on Computer vision/computer graphics collaboration techniques
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Proceedings of the 24th ACM International Conference on Supercomputing
Adaptive prefetching for shared cache based chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Performance limitations of block-multithreaded distributed-memory systems

Winter Simulation Conference
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
When Prefetching Works, When It Doesn’t, and Why

ACM Transactions on Architecture and Code Optimization (TACO)
Improving instruction delivery with a block-aware ISA

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A memory bandwidth effective cache store miss policy

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Application-Specific hardware-driven prefetching to improve data cache performance

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Targeted data prefetching

ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Analysis of Performance Bottlenecks in Multithreaded Multiprocessor Systems

Fundamenta Informaticae - Application of Concurrency to System Design
Diagnosis and optimization of application prefetching performance

Proceedings of the 27th international ACM conference on International conference on supercomputing
Meeting midway: improving CMP performance with memory-side prefetching

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.01

Visualization

Abstract

Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.