Access ordering and memory-conscious cache utilization

Authors:
S. A. Mckee;W. A. Wulf
Affiliations:
-;-
Venue:
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Year:
1995

Citing 22
Cited 20

Vector access performance in parallel memories using skewed storage scheme

IEEE Transactions on Computers
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Code generation for streaming: an access/execute mechanism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Pseudo-randomly interleaved memory

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Increasing the number of strides for conflict-free vector access

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A comparison of three current superscalar designs

ACM SIGARCH Computer Architecture News
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
The Chinese remainder theorem and the prime memory system

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Increasing memory bandwidth for vector computations

Proceedings of the international conference on Programming languages and system architectures
Access ordering and effective memory bandwidth

Access ordering and effective memory bandwidth
Blocking Linear Algebra Codes for Memory Hierarchies

Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific Computing
Performance of the IPSC/860 Node Architecture

Performance of the IPSC/860 Node Architecture
Hardware Support for Dynamic Access Ordering: Performance of Some Design Options

Hardware Support for Dynamic Access Ordering: Performance of Some Design Options
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications
The Organization and Use of Parallel Memories

IEEE Transactions on Computers

Design and evaluation of dynamic access ordering hardware

ICS '96 Proceedings of the 10th international conference on Supercomputing
Increasing TLB reach using superpages backed by shadow memory

Proceedings of the 25th annual international symposium on Computer architecture
A performance comparison of contemporary DRAM architectures

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Hardware-only stream prefetching and dynamic access ordering

Proceedings of the 14th international conference on Supercomputing
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Dynamic Access Ordering for Streamed Computations

IEEE Transactions on Computers
Concurrency, latency, or system overhead: which has the largest impact on uniprocessor DRAM-system performance?

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The Impulse Memory Controller

IEEE Transactions on Computers
High-Performance DRAMs in Workstation Environments

IEEE Transactions on Computers
Designing a Modern Memory Hierarchy with Hardware Prefetching

IEEE Transactions on Computers
Memory System Support for Irregular Applications

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
DRAM-Page Based Prediction and Prefetching

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
A Case for Studying DRAM Issues at the System Level

IEEE Micro
Custom Data Layout for Memory Parallelism

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Locality-conscious workload assignment for array-based computations in MPSOC architectures

Proceedings of the 42nd annual Design Automation Conference
Impulse: Memory system support for scientific applications

Scientific Programming
A memory-conscious code parallelization scheme

Proceedings of the 44th annual Design Automation Conference
Designing packet buffers for router linecards

IEEE/ACM Transactions on Networking (TON)
Decoupled DIMM: building high-bandwidth memory system using low-speed DRAM devices

Proceedings of the 36th annual international symposium on Computer architecture
Physically addressed queueing (PAQ): improving parallelism in solid state disks

Proceedings of the 39th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

As processor speeds increase relative to memory speeds, memory bandwidth is rapidly becoming the limiting performance, factor for many applications. Several approaches to bridging this performance gap have been suggested. This paper examines one approach, access ordering, and pushes its limits to determine bounds on memory performance. We present several access-ordering schemes, and compare their performance, developing analytic models and partially validating these with benchmark timings on the Intel i860XR.