Memory-side prefetching for linked data structures for processor-in-memory systems

Authors:
Christopher J. Hughes;Sarita V. Adve
Affiliations:
Architecture Research Lab, Intel Corporation, 2200 Mission College Blvd., SC12-303, Santa Clara, CA 94054, USA;Department of Computer Science, University of Illinois at Urbana-Champaign, 201 N. Goodwin Ave., Urbana, IL 61801, USA
Venue:
Journal of Parallel and Distributed Computing
Year:
2005

Citing 28
Cited 7

Software caching and computation migration in Olden

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
SPAID: software prefetching in pointer- and call-intensive environments

Proceedings of the 28th annual international symposium on Microarchitecture
An effective programmable prefetch engine for on-chip caches

Proceedings of the 28th annual international symposium on Microarchitecture
Missing the memory wall: the case for processor/memory integration

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Examination of a memory access classification scheme for pointer-intensive and numeric programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Tango: a hardware-based data prefetching technique for superscalar processors

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
DataScalar architectures

Proceedings of the 24th annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory

Proceedings of the 25th annual international symposium on Computer architecture
Dependence based prefetching for linked data structures

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A look at several memory management units, TLB-refill mechanisms, and page table organizations

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Correlated load-address predictors

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Effective jump-pointer prefetching for linked data structures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Push vs. pull: data movement for linked data structures

Proceedings of the 14th international conference on Supercomputing
Processing in Memory: The Terasys Massively Parallel PIM Array

Computer
Scalable Processors in the Billion-Transistor Era: IRAM

Computer
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Distributed Prefetch-buffer/Cache Design for High Performance Memory Systems

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
FlexRAM: Toward an Advanced Intelligent Memory System

ICCD '99 Proceedings of the 1999 IEEE International Conference on Computer Design

Overlapping dependent loads with addressless preload

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Stealth prefetching

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Real-time rendering systems in 2010

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
An immune inspired co-evolutionary affinity network for prefetching of distributed object

Journal of Parallel and Distributed Computing
Engineering scalable, cache and space efficient tries for strings

The VLDB Journal — The International Journal on Very Large Data Bases
Redesigning the string hash table, burst trie, and BST to exploit cache

Journal of Experimental Algorithmics (JEA)
Meeting midway: improving CMP performance with memory-side prefetching

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies a memory-side prefetching technique to hide latency incurred by inherently serial accesses to linked data structures (LDS). A programmable engine sits close to memory and traverses LDS independently from the processor. The engine can run ahead of the processor because of its low latency path to memory, allowing it to initiate data transfers earlier than the processor and pipeline multiple transfers over the network. We evaluate the proposed memory-side prefetching scheme for the Olden benchmarks on a processor-in-memory system. For the six benchmarks where LDS memory stall time is significant, the memory-side scheme reduces execution time by an average of 27% compared to a system without any prefetching. Compared to a state-of-the-art processor-side software prefetching scheme, the memory-side scheme reduces execution time in the range of 20-50% for three of the six applications, is about the same for two applications, and is worse by 18% for one application. We conclude that our memory-side scheme is effective, but a combination of the processor- and memory-side prefetching schemes is best and provide a qualitative framework to determine when either scheme should be used.