Complexity/performance tradeoffs with non-blocking loads

Authors:
K. I. Farkas;N. P. Jouppi
Affiliations:
Dept. of Electrical and Computer Engineering, University of Toronto, 10 Kings College Rd., Toronto Ontario Canada, M5S 1A4;Digital Equipment Corporation Western Research Lab, 250 University Avenue, Palo Alto, CA
Venue:
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Year:
1994

Citing 11
Cited 40

High-bandwidth data memory systems for superscalar processors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Hiding memory latency using dynamic scheduling in shared-memory multiprocessors

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Software support for speculative loads

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Reducing memory latency via non-blocking and prefetching caches

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cache write policies and performance

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
The Mips R4000 Processor

IEEE Micro
Organization of the Motorola 88110 Superscalar RISC Microprocessor

IEEE Micro
Performance Features of the PA7100 Microprocessor

IEEE Micro
The Alpha AXP Architecture and 21064 Processor

IEEE Micro
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture

Reducing cache conflicts in data cache prefetching

ACM SIGARCH Computer Architecture News - Special issue on input/output in parallel computer systems
The effectiveness of multiple hardware contexts

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
An analytical model of high performance superscalar-based multiprocessors

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Increasing cache port efficiency for dynamic superscalar microprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Informing memory operations: providing memory performance feedback in modern processors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
A discussion on non-blocking/lockup-free caches

ACM SIGARCH Computer Architecture News
Data prefetching and multilevel blocking for linear algebra operations

ICS '96 Proceedings of the 10th international conference on Supercomputing
Speculative execution via address prediction and data prefetching

ICS '97 Proceedings of the 11th international conference on Supercomputing
Designing high bandwidth on-chip caches

Proceedings of the 24th annual international symposium on Computer architecture
Memory-system design considerations for dynamically-scheduled processors

Proceedings of the 24th annual international symposium on Computer architecture
Prefetching using Markov predictors

Proceedings of the 24th annual international symposium on Computer architecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Informing memory operations: memory performance feedback mechanisms and their applications

ACM Transactions on Computer Systems (TOCS)
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Prefetching Using Markov Predictors

IEEE Transactions on Computers - Special issue on cache memory and related problems
Fetch directed instruction prefetching

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
The Multicluster Architecture: Reducing Processor Cycle Time Through Partitioning

International Journal of Parallel Programming
Using complete system simulation to characterize SPECjvm98 benchmarks

Proceedings of the 14th international conference on Supercomputing
Matrix multiplication: a case study of enhanced data cache utilization

Journal of Experimental Algorithmics (JEA)
Predictor-directed stream buffers

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
High Bandwidth On-Chip Cache Design

IEEE Transactions on Computers
Content-Based Prefetching: Initial Results

IMS '00 Revised Papers from the Second International Workshop on Intelligent Memory Systems
How Useful Are Non-Blocking Loads, Stream Buffers and Speculative Execution in Multiple Issue Processors?

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Register File Design Considerations in Dynamically Scheduled Processors

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Just Say No: Benefits of Early Cache Miss Determination

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
CAVA: Using checkpoint-assisted value prediction to hide L2 misses

ACM Transactions on Architecture and Code Optimization (TACO)
Energy-efficient instruction scheduling utilizing cache miss information

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Accurate memory data flow modeling in statistical simulation

Proceedings of the 20th annual international conference on Supercomputing
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Cache Miss Handling for High Memory-Level Parallelism

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Performance implications of multiple pointer sizes

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Synchronization coherence: A transparent hardware mechanism for cache coherence and fine-grained synchronization

Journal of Parallel and Distributed Computing
Memory Data Flow Modeling in Statistical Simulation for the Efficient Exploration of Microprocessor Design Spaces

IEEE Transactions on Computers
The performance of pollution control victim cache for embedded systems

Proceedings of the 21st annual symposium on Integrated circuits and system design
A light-weight fairness mechanism for chip multiprocessor memory systems

Proceedings of the 6th ACM conference on Computing frontiers
Do trace cache, value prediction and prefetching improve SMT throughput?

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
A multizone pipelined cache for IP routing

NETWORKING'05 Proceedings of the 4th IFIP-TC6 international conference on Networking Technologies, Services, and Protocols; Performance of Computer and Communication Networks; Mobile and Wireless Communication Systems
A high performance adaptive miss handling architecture for chip multiprocessors

Transactions on High-Performance Embedded Architectures and Compilers IV

Quantified Score

Hi-index	0.01

Visualization

Abstract

Non-blocking loads are a very effective technique for tolerating the cache-miss latency on data cache references. In this paper, we describe several methods for implementing non-blocking loads. A range of resulting hardware complexity/performance tradeoffs are investigated using an object-code translation and instrumentation system. We have investigated the SPEC92 benchmarks and have found that for the integer benchmarks, a simple hit-under-miss implementation achieves almost all of the available performance improvement for relatively little cost. However, for most of the numeric benchmarks, more expensive implementations are worthwhile. The results also point out the importance of using a compiler capable of scheduling load instructions for cache misses rather than cache hits in non-blocking systems.