The Memory Bandwidth Bottleneck and its Amelioration by a Compiler

Authors:
Affiliations:
Venue:
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Year:
2000

Citing 0
Cited 25

Fast greedy weighted fusion

Proceedings of the 14th international conference on Supercomputing
Data and memory optimization techniques for embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Language support for Morton-order matrices

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Two techniques for reconciling algorithm parallelism with memory constraints

Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
Fast Greedy Weighted Fusion

International Journal of Parallel Programming
Analysis of performance bottlenecks in multithreaded multiprocessor systems

Fundamenta Informaticae - Application of concurrency to system design
Increasing hardware data prefetching performance using the second-level cache

Journal of Systems Architecture: the EUROMICRO Journal
Data remapping for design space optimization of embedded memory systems

ACM Transactions on Embedded Computing Systems (TECS)
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
References

Sourcebook of parallel computing
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Using Scratchpad to Exploit Object Locality in Java

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Analyzing the Energy-Time Trade-Off in High-Performance Computing Applications

IEEE Transactions on Parallel and Distributed Systems
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Instruction balance and its relation to program energy consumption

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
A modular coprocessor architecture for embedded real-time image and video signal processing

SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
Static reuse distances for locality-based optimizations in MATLAB

Proceedings of the 24th ACM International Conference on Supercomputing
Generalized index-set splitting

CC'05 Proceedings of the 14th international conference on Compiler Construction
Low power engineering

Embedded Systems Design
Removing impediments to loop fusion through code transformations

LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Improving the memory bandwidth utilization using loop transformations

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Iterative collective loop fusion

CC'06 Proceedings of the 15th international conference on Compiler Construction
Analysis of Performance Bottlenecks in Multithreaded Multiprocessor Systems

Fundamenta Informaticae - Application of Concurrency to System Design
A coldness metric for cache optimization

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
On-chip traffic regulation to reduce coherence protocol cost on a microthreaded many-core architecture with distributed caches

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the speed gap between CPU and memory widens, memory hierarchy has become the primary factor limiting program performance. Until now, the principal focus of hardware and software innovations has been overcoming latency. However, the advent of latency tolerance techniques such as non-blocking cache and software prefetching begins the process of trading bandwidth for latency by overlapping and pipelining memory transfers. Since actual latency is the inverse of the consumed bandwidth, memory latency cannot be fully tolerated without infinite bandwidth. This perspective has led us to two questions. Do current machines provide sufficient data bandwidth? If not, can a program be restructured to consume less bandwidth?This paper answers these questions in two parts. The first part defines a new bandwidth-based performance model and demonstrates the serious performance bottleneck due to the lack of memory bandwidth. The second part describes a new set of compiler optimizations for reducing bandwidth consumption of programs. The optimizations are bandwidth-minimal loop fusion, array shrinking and peeling, and store elimination.