Evaluating stream buffers as a secondary cache replacement

Authors:
S. Palacharla;R. E. Kessler
Affiliations:
Computer Sciences Department, University of Wisconsin-Madison, Madison, WI;Cray Research, Inc., 900 Lowater Rd., Chippewa Falls, WI
Venue:
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Year:
1994

Citing 11
Cited 97

Cache Operations by MRU Change

IEEE Transactions on Computers
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Prefetch unit for vector operations on scalar computers

ACM SIGARCH Computer Architecture News
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Prefetching in supercomputer instruction caches

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Cache Memories

ACM Computing Surveys (CSUR)

Instruction fetching: coping with code bloat

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Memory bandwidth limitations of future microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Design and evaluation of dynamic access ordering hardware

ICS '96 Proceedings of the 10th international conference on Supercomputing
Examination of a memory access classification scheme for pointer-intensive and numeric programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Optimizing primary data caches for parallel scientific applications: the pool buffer approach

ICS '96 Proceedings of the 10th international conference on Supercomputing
Tango: a hardware-based data prefetching technique for superscalar processors

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Adaptive page replacement based on memory reference behavior

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Memory-system design considerations for dynamically-scheduled processors

Proceedings of the 24th annual international symposium on Computer architecture
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Prediction caches for superscalar processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Characterization and improvement of load/store cache-based prefetching

ICS '98 Proceedings of the 12th international conference on Supercomputing
Retrospective: improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

25 years of the international symposia on Computer architecture (selected papers)
Capturing dynamic memory reference behavior with adaptive cache topology

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Prefetching Using Markov Predictors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Effects of Multithreading on Cache Performance

IEEE Transactions on Computers - Special issue on cache memory and related problems
A locality sensitive multi-module cache with explicit management

ICS '99 Proceedings of the 13th international conference on Supercomputing
Hardware-only stream prefetching and dynamic access ordering

Proceedings of the 14th international conference on Supercomputing
Push vs. pull: data movement for linked data structures

Proceedings of the 14th international conference on Supercomputing
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
On Interaction between Interconnection Network Design and Latency Hiding Techniques in Multiprocessors

The Journal of Supercomputing
Data prefetch mechanisms

ACM Computing Surveys (CSUR)
Predictor-directed stream buffers

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Access pattern based local memory customization for low power embedded systems

Proceedings of the conference on Design, automation and test in Europe
Hardware prediction for data coherency of scientific codes on DSM

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
Fundamental limitations on the use of prefetching and stream buffers for scientific applications

Proceedings of the 2001 ACM symposium on Applied computing
Optimizations Enabled by a Decoupled Front-End Architecture

IEEE Transactions on Computers
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Dead-block prediction & dead-block correlating prefetchers

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
APEX: access pattern based memory architecture exploration

Proceedings of the 14th international symposium on Systems synthesis
The Impulse Memory Controller

IEEE Transactions on Computers
Designing a Modern Memory Hierarchy with Hardware Prefetching

IEEE Transactions on Computers
Performance of the CRAY T3E multiprocessor

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Using a user-level memory thread for correlation prefetching

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
DSTRIDE: data-cache miss-address-based stride prefetching scheme for multimedia processors

ACSAC '01 Proceedings of the 6th Australasian conference on Computer systems architecture
MIST: an algorithm for memory miss traffic management

Proceedings of the 2000 IEEE/ACM international conference on Computer-aided design
A stateless, content-directed data prefetching mechanism

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Access pattern-based memory and connectivity architecture exploration

ACM Transactions on Embedded Computing Systems (TECS)
A Decoupled Predictor-Directed Stream Prefetching Architecture

IEEE Transactions on Computers
Stride-directed Prefetching for Secondary Caches

ICPP '97 Proceedings of the international Conference on Parallel Processing
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Value-Profile Guided Stride Prefetching for Irregular Code

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Content-Based Prefetching: Initial Results

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
Improving Performance for Software MPEG Players

COMPCON '96 Proceedings of the 41st IEEE International Computer Conference
How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors?

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Using memory-mapped network interfaces to improve the performance of distributed shared memory

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Correlation Prefetching with a User-Level Memory Thread

IEEE Transactions on Parallel and Distributed Systems
A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching

ACM Transactions on Computer Systems (TOCS)
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

IEEE Transactions on Computers
Effective stream-based and execution-based data prefetching

Proceedings of the 18th annual international conference on Supercomputing
Cluster miss prediction with prefetch on miss for embedded CPU instruction caches

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Cache Refill/Access Decoupling for Vector Machines

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Addressing mode driven low power data caches for embedded processors

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Memory predecryption: hiding the latency overhead of memory encryption

ACM SIGARCH Computer Architecture News - Special issue: Workshop on architectural support for security and anti-virus (WASSA)
On the performance of trace locality of reference

Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
Future Execution: A Hardware Prefetching Technique for Chip Multiprocessors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Improving data cache performance with integrated use of split caches, victim cache and stream buffers

MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
On the importance of optimizing the configuration of stream prefetchers

Proceedings of the 2005 workshop on Memory system performance
Spectral prefetcher: An effective mechanism for L2 cache prefetching

ACM Transactions on Architecture and Code Optimization (TACO)
Memory access pattern analysis and stream cache design for multimedia applications

ASP-DAC '03 Proceedings of the 2003 Asia and South Pacific Design Automation Conference
A Self-Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework

Proceedings of the International Symposium on Code Generation and Optimization
A Case for MLP-Aware Cache Replacement

Proceedings of the 33rd annual international symposium on Computer Architecture
CAVA: Using checkpoint-assisted value prediction to hide L2 misses

ACM Transactions on Architecture and Code Optimization (TACO)
Pattern-driven prefetching for multimedia applications on embedded processors

Journal of Systems Architecture: the EUROMICRO Journal
Memory bandwidth optimization through stream descriptors

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Efficient emulation of hardware prefetchers via event-driven helper threading

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

ACM Transactions on Architecture and Code Optimization (TACO)
Memory Prefetching Using Adaptive Stream Detection

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Impulse: Memory system support for scientific applications

Scientific Programming
Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Prefetching irregular references for software cache on cell

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
HMTT: a platform independent full-system memory trace monitoring system

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
PFetch: software prefetching exploiting temporal predictability of memory access streams

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Access map pattern matching for data cache prefetch

Proceedings of the 23rd international conference on Supercomputing
Stream chaining: exploiting multiple levels of correlation in data prefetching

Proceedings of the 36th annual international symposium on Computer architecture
Blue Gene/L compute chip: memory and Ethernet subsystem

IBM Journal of Research and Development
Timing local streams: improving timeliness in data prefetching

Proceedings of the 24th ACM International Conference on Supercomputing
SAMS multi-layout memory: providing multiple views of data to boost SIMD performance

Proceedings of the 24th ACM International Conference on Supercomputing
Understanding the behavior and implications of context switch misses

ACM Transactions on Architecture and Code Optimization (TACO)
Improving cache locality for thread-level speculation

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Streaming Data Movement for Real-Time Image Analysis

Journal of Signal Processing Systems
Virtual memory window for application-specific reconfigurable coprocessors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Journal of Parallel and Distributed Computing
Bandwidth constrained coordinated HW/SW prefetching for multicores

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Global-aware and multi-order context-based prefetching for high-performance processors

International Journal of High Performance Computing Applications
ABS: A low-cost adaptive controller for prefetching in a banked shared last-level cache

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
The gradient-based cache partitioning algorithm

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
A high performance heterogeneous architecture and its optimization design

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Making data prefetch smarter: adaptive prefetching on POWER7

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Application data prefetching on the IBM blue gene/Q supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Diagnosis and optimization of application prefetching performance

Proceedings of the 27th international ACM conference on International conference on supercomputing
S/DC: a storage and energy efficient data prefetcher

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Linearizing irregular memory accesses for improved correlated prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.02

Visualization

Abstract

Today's commodity microprocessors require a low latency memory system to achieve high sustained performance. The conventional high-performance memory system provides fast data access via a large secondary cache. But large secondary caches can be expensive, particularly in large-scale parallel systems with many processors (and thus many caches).We evaluate a memory system design that can be both cost-effective as well as provide better performance, particularly for scientific workloads: a single level of (on-chip) cache backed up only by Jouppi's stream buffers [10] and a main memory. This memory system requires very little hardware compared to a large secondary cache and doesn't require modifications to commodity processors. We use trace-driven simulation of fifteen scientific applications from the NAS and PERFECT suites in our evaluation. We present two techniques to enhance the effectiveness of Jouppi's original stream buffers: filtering schemes to reduce their memory bandwidth requirement and a scheme that enables stream buffers to prefetch data being accessed in large strides. Our results show that, for the majority of our benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches. Also, we find that as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data-set sizes.