Using a user-level memory thread for correlation prefetching

Authors:
Yan Solihin;Jaejin Lee;Josep Torrellas
Affiliations:
University of Illinois, Urbana-Champaign;Michigan State University;University of Illinois, Urbana-Champaign
Venue:
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Year:
2002

Citing 19
Cited 41

Reducing memory latency via non-blocking and prefetching caches

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
Comparing data forwarding and prefetching for communication-induced misses in shared-memory MPs

ICS '98 Proceedings of the 12th international conference on Supercomputing
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Push vs. pull: data movement for linked data structures

Proceedings of the 14th international conference on Supercomputing
Predictor-directed stream buffers

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dead-block prediction & dead-block correlating prefetchers

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Scalable Processors in the Billion-Transistor Era: IRAM

Computer
Multi-Chain Prefetching: Effective Exploitation of Inter-Chain Memory Parallelism for Pointer-Chasing Codes

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Distributed Prefetch-buffer/Cache Design for High Performance Memory Systems

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
An Direct-Execution Framework for Fast and Accurate Simulation of Superscalar Processors

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Memory-Side Prefetching for Linked Data Structures

Memory-Side Prefetching for Linked Data Structures

A Decoupled Predictor-Directed Stream Prefetching Architecture

IEEE Transactions on Computers
A quantitative framework for automated pre-execution thread selection

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
TCP: Tag Correlating Prefetchers

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Correlation Prefetching with a User-Level Memory Thread

IEEE Transactions on Parallel and Distributed Systems
A first glance at Kilo-instruction based multiprocessors

Proceedings of the 1st conference on Computing frontiers
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Data forwarding through in-memory precomputation threads

Proceedings of the 18th annual international conference on Supercomputing
CQoS: a framework for enabling QoS in shared caches of CMP platforms

Proceedings of the 18th annual international conference on Supercomputing
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
Tolerating memory latency through push prefetching for pointer-intensive applications

ACM Transactions on Architecture and Code Optimization (TACO)
Data Cache Prefetching Using a Global History Buffer

IEEE Micro
Memory predecryption: hiding the latency overhead of memory encryption

ACM SIGARCH Computer Architecture News - Special issue: Workshop on architectural support for security and anti-virus (WASSA)
Exploring the limits of prefetching

IBM Journal of Research and Development - Electrochemical technology in microelectronics
Store-Ordered Streaming of Shared Memory

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Load squared: adding logic close to memory to reduce the latency of indirect loads with high miss ratios

MEDEA '04 Proceedings of the 2004 workshop on MEmory performance: DEaling with Applications , systems and architecture
Performance characteristics of MAUI: an intelligent memory system architecture

Proceedings of the 2005 workshop on Memory system performance
Spatial Memory Streaming

Proceedings of the 33rd annual international symposium on Computer Architecture
Efficient address remapping in distributed shared-memory systems

ACM Transactions on Architecture and Code Optimization (TACO)
HeapMon: a helper-thread approach to programmable, automatic, and low-overhead memory bug detection

IBM Journal of Research and Development
Efficient emulation of hardware prefetchers via event-driven helper threading

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Overlapping dependent loads with addressless preload

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Reducing Cache Pollution via Dynamic Data Prefetch Filtering

IEEE Transactions on Computers
Memory Prefetching Using Adaptive Stream Detection

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Active memory operations

Proceedings of the 21st annual international conference on Supercomputing
Improving SDRAM access energy efficiency for low-power embedded systems

ACM Transactions on Embedded Computing Systems (TECS)
Data access history cache and associated data prefetching mechanisms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Load squared: Adding logic close to memory to reduce the latency of indirect loads in embedded and general systems

Journal of Embedded Computing - Embeded Processors and Systems: Architectural Issues and Solutions for Emerging Applications
Server-based data push architecture for multi-processor environments

Journal of Computer Science and Technology
Temporal instruction fetch streaming

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Spatio-temporal memory streaming

Proceedings of the 36th annual international symposium on Computer architecture
Stream chaining: exploiting multiple levels of correlation in data prefetching

Proceedings of the 36th annual international symposium on Computer architecture
COMPASS: a programmable data prefetcher using idle GPU shaders

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
An Adaptive Data Prefetcher for High-Performance Processors

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Coterminous locality and coterminous group data prefetching on chip-multiprocessors

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Helper thread prefetching for loosely-coupled multiprocessor systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Making data prefetch smarter: adaptive prefetching on POWER7

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Active memory controller

The Journal of Supercomputing
Algorithm-level Feedback-controlled Adaptive data prefetcher: Accelerating data access for high-performance processors

Parallel Computing
Linearizing irregular memory accesses for improved correlated prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper introduces the idea of using a User-Level Memory Thread (ULMT) for correlation prefetching. In this approach, a user thread runs on a general-purpose processor in main memory, either in the memory controller chip or in a DRAM chip. The thread performs correlation prefetching in software, sending the prefetched data into the L2 cache of the main processor. This approach requires minimal hardware beyond the memory processor: the correlation table is a software data structure that resides in main memory, while the main processor only needs a few modifications to its L2 cache so that it can accept incoming prefetches. In addition, the approach has wide usability, as it can effectively prefetch even for irregular applications. Finally, it is very flexible, as the prefetching algorithm can be customized by the user on an application basis. Our simulation results show that, through a new design of the correlation table and prefetching algorithm, our scheme delivers good results. Specifically, nine mostly-irregular applications show an average speedup of 1.32. Furthermore, our scheme works well in combination with a conventional processor-side sequential prefetcher, in which case the average speedup increases to 1.46. Finally, by exploiting the customization of the prefetching algorithm, we increase the average speedup to 1.53.