Dynamic Access Ordering for Streamed Computations

Authors:
Sally A. McKee;William A. Wulf;James H. Aylor;Maximo H. Salinas;Robert H. Klenke;Sung I. Hong;Dee A. B. Weikle
Affiliations:
Univ. of Utah, Salt Lake City;Univ. of Virginia, Charlottesville;Univ. of Virginia, Charlottesville;Univ. of Virginia, Charlottesville;Virginia Commonwealth Univ., Richmond;Lockheed Federal Systems, Manassa, VA;Univ. of Virginia, Charlottsville
Venue:
IEEE Transactions on Computers
Year:
2000

Citing 36
Cited 29

Software-controlled caches in the VMP multiprocessor

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Vector access performance in parallel memories using skewed storage scheme

IEEE Transactions on Computers
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Code generation for streaming: an access/execute mechanism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Pseudo-randomly interleaved memory

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Using Lookahead to reduce memory bank contention for decoupled operand references

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Increasing the number of strides for conflict-free vector access

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Prefetch unit for vector operations on scalar computers

ACM SIGARCH Computer Architecture News
Tolerating data access latency with register preloading

ICS '92 Proceedings of the 6th international conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
An efficient architecture for loop based data preloading

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The Chinese remainder theorem and the prime memory system

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Memory access coalescing: a technique for eliminating redundant memory accesses

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Evaluating stream buffers as a secondary cache replacement

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Access ordering and effective memory bandwidth

Access ordering and effective memory bandwidth
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Maximizing memory bandwidth for streamed computations

Maximizing memory bandwidth for streamed computations
Memory-system design considerations for dynamically-scheduled processors

Proceedings of the 24th annual international symposium on Computer architecture
Increasing TLB reach using superpages backed by shadow memory

Proceedings of the 25th annual international symposium on Computer architecture
Design challenges of virtual networks: fast, general-purpose communication

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Sunder: a programmable hardware prefetch architecture for numerical loops

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Direct Rambus Technology: The New Main Memory Standard

IEEE Micro
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Access order to avoid inter-vector-conflicts in complex memory systems

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Access ordering and memory-conscious cache utilization

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Access Order and Effective Bandwidth for Streams on a Direct Rambus Memory

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Memory System Support for Image Processing

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques

Analyzing energy friendly steady state phases of dynamic application execution in terms of sparse data structures

Proceedings of the 2002 international symposium on Low power electronics and design
Incorporating energy efficient data structures into modular software implementations for internet-based embedded systems

WOSP '02 Proceedings of the 3rd international workshop on Software and performance
A Class of Code Compression Schemes for Reducing Power Consumption in Embedded Microprocessor Systems

IEEE Transactions on Computers
Array organization in parallel memories

International Journal of Parallel Programming
Adaptive History-Based Memory Schedulers

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Memory Controller Optimizations for Web Servers

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Identifying and Exploiting Spatial Regularity in Data Memory References

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Adaptive History-Based Memory Schedulers for Modern Processors

IEEE Micro
Exploiting locality to ameliorate packet queue contention and serialization

Proceedings of the 3rd conference on Computing frontiers
The bit-reversal SDRAM address mapping

SCOPES '05 Proceedings of the 2005 workshop on Software and compilers for embedded systems
Memory bandwidth optimization through stream descriptors

MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Memory scheduling for modern microprocessors

ACM Transactions on Computer Systems (TOCS)
Exploiting program cyclic behavior to reduce memory latency in embedded processors

Proceedings of the 2008 ACM symposium on Applied computing
Optimizing thread throughput for multithreaded workloads on memory constrained CMPs

Proceedings of the 5th conference on Computing frontiers
Configurable data memory for multimedia processing

Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Prefetch-Aware DRAM Controllers

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Global management of cache hierarchies

Proceedings of the 7th ACM international conference on Computing frontiers
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Streaming Data Movement for Real-Time Image Analysis

Journal of Signal Processing Systems
Memory-access-aware data structure transformations for embedded software with dynamic data accesses

IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2002 international symposium on low-power electronics and design (ISLPED)
A bursty multi-port memory controller with quality-of-service guarantees

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Hierarchical memory scheduling for multimedia MPSoCs

Proceedings of the International Conference on Computer-Aided Design
Parallel application memory scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
Conservative row activation to improve memory power efficiency

Proceedings of the 27th international ACM conference on International conference on supercomputing
Return data interleaving for multi-channel embedded CMPs systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Reducing DRAM row activations with eager read/write clustering

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	14.98

Visualization

Abstract

Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does not increase bandwidth requirements. The SMC is practical to implement, using existing compiler technology and requiring only a modest amount of special-purpose hardware. We present simulation results for fast-page mode and Rambus DRAM memory systems and we describe a prototype system with which we have observed performance improvements for inner loops by factors of 13 over traditional access methods.