Hardware spatial forwarding for widely shared data

Authors:
Marius Pirvu;Laxmi Bhuyan
Affiliations:
Department of Computer Science, Texas A&M University, College Station, TX;Department of Computer Science, Texas A&M University, College Station, TX
Venue:
Proceedings of the 14th international conference on Supercomputing
Year:
2000

Citing 25
Cited 1

Introduction to algorithms

Introduction to algorithms
High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Data access microarchitectures for superscalar processors with compiler-assisted data prefetching

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The DASH prototype: implementation and performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Stride directed prefetching in scalar processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Limitations of cache prefetching on a bus-based multiprocessor

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
The GLOW cache coherence protocol extensions for widely shared data

ICS '96 Proceedings of the 10th international conference on Supercomputing
Tango: a hardware-based data prefetching technique for superscalar processors

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Data Forwarding in Scalable Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Designing high bandwidth on-chip caches

Proceedings of the 24th annual international symposium on Computer architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Performance Characterization of the Pentium® Pro Processor

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Switch Cache: A Framework for Improving the Remote Memory Access Latency of CC-NUMA Multiprocessors

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Circular Buffered Switch Design with Wormhole Routing and Virtual Channels

ICCD '98 Proceedings of the International Conference on Computer Design

Characterization and Evaluation of Cache Hierarchies for Web Servers

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Applications with widely shared data do not perform well on cc-NUMA multiprocessors due to the hot-spots they create in the system. In this paper we address this problem by enhancing the memory controller with a forwarding mechanism capable of hiding the read latency of widely shared data, while potentially decreasing the memory and network contention. Based on the influx of requests, the memory anticipates the next read references and forwards the data in advance to the processors. To identify the set of processors the data is to be forwarded to we use a heuristic based on the spatial locality of memory blocks. To increase the forwarding effectiveness and minimize the number of messages, we incorporate simple filters combined with a feedback mechanism. We also show that further improvements are possible using a combined software-prefetching/hardware-forwarding approach. Our experimental results obtained with a detailed execution driven simulator with ILP processors show significant improvements in execution time (up to 37%).