Software-controlled caches in the VMP multiprocessor
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Vector access performance in parallel memories using skewed storage scheme
IEEE Transactions on Computers
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Code generation for streaming: an access/execute mechanism
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
An architecture for software-controlled data prefetching
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Data prefetching in multiprocessor vector cache memories
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Pseudo-randomly interleaved memory
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Using Lookahead to reduce memory bank contention for decoupled operand references
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Increasing the number of strides for conflict-free vector access
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Prefetch unit for vector operations on scalar computers
ACM SIGARCH Computer Architecture News
Tolerating data access latency with register preloading
ICS '92 Proceedings of the 6th international conference on Supercomputing
Design and evaluation of a compiler algorithm for prefetching
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
An efficient architecture for loop based data preloading
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The Chinese remainder theorem and the prime memory system
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Memory access coalescing: a technique for eliminating redundant memory accesses
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Access ordering and effective memory bandwidth
Access ordering and effective memory bandwidth
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Maximizing memory bandwidth for streamed computations
Maximizing memory bandwidth for streamed computations
Memory-system design considerations for dynamically-scheduled processors
Proceedings of the 24th annual international symposium on Computer architecture
Increasing TLB reach using superpages backed by shadow memory
Proceedings of the 25th annual international symposium on Computer architecture
Design challenges of virtual networks: fast, general-purpose communication
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Sunder: a programmable hardware prefetch architecture for numerical loops
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Sequential Hardware Prefetching in Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Access order to avoid inter-vector-conflicts in complex memory systems
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Access ordering and memory-conscious cache utilization
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Access Order and Effective Bandwidth for Streams on a Direct Rambus Memory
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Impulse: Building a Smarter Memory Controller
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Command Vector Memory Systems: High Performance at Low Cost
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Memory System Support for Image Processing
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 2002 international symposium on Low power electronics and design
WOSP '02 Proceedings of the 3rd international workshop on Software and performance
IEEE Transactions on Computers
Array organization in parallel memories
International Journal of Parallel Programming
Adaptive History-Based Memory Schedulers
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Memory Controller Optimizations for Web Servers
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Identifying and Exploiting Spatial Regularity in Data Memory References
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Exploiting locality to ameliorate packet queue contention and serialization
Proceedings of the 3rd conference on Computing frontiers
The bit-reversal SDRAM address mapping
SCOPES '05 Proceedings of the 2005 workshop on Software and compilers for embedded systems
Memory bandwidth optimization through stream descriptors
MEDEA '05 Proceedings of the 2005 workshop on MEmory performance: DEaling with Applications , systems and architecture
Memory scheduling for modern microprocessors
ACM Transactions on Computer Systems (TOCS)
Exploiting program cyclic behavior to reduce memory latency in embedded processors
Proceedings of the 2008 ACM symposium on Applied computing
Optimizing thread throughput for multithreaded workloads on memory constrained CMPs
Proceedings of the 5th conference on Computing frontiers
Configurable data memory for multimedia processing
Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Prefetch-Aware DRAM Controllers
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Global management of cache hierarchies
Proceedings of the 7th ACM international conference on Computing frontiers
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Streaming Data Movement for Real-Time Image Analysis
Journal of Signal Processing Systems
Memory-access-aware data structure transformations for embedded software with dynamic data accesses
IEEE Transactions on Very Large Scale Integration (VLSI) Systems - Special section on the 2002 international symposium on low-power electronics and design (ISLPED)
A bursty multi-port memory controller with quality-of-service guarantees
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Hierarchical memory scheduling for multimedia MPSoCs
Proceedings of the International Conference on Computer-Aided Design
Parallel application memory scheduling
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Conservative row activation to improve memory power efficiency
Proceedings of the 27th international ACM conference on International conference on supercomputing
Return data interleaving for multi-channel embedded CMPs systems
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Reducing DRAM row activations with eager read/write clustering
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 14.98 |
Memory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does not increase bandwidth requirements. The SMC is practical to implement, using existing compiler technology and requiring only a modest amount of special-purpose hardware. We present simulation results for fast-page mode and Rambus DRAM memory systems and we describe a prototype system with which we have observed performance improvements for inner loops by factors of 13 over traditional access methods.