Proceedings of the 14th international conference on Supercomputing
Data and memory optimization techniques for embedded systems
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Language support for Morton-order matrices
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Two techniques for reconciling algorithm parallelism with memory constraints
Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures
International Journal of Parallel Programming
Analysis of performance bottlenecks in multithreaded multiprocessor systems
Fundamenta Informaticae - Application of concurrency to system design
Increasing hardware data prefetching performance using the second-level cache
Journal of Systems Architecture: the EUROMICRO Journal
Data remapping for design space optimization of embedded memory systems
ACM Transactions on Embedded Computing Systems (TECS)
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Sourcebook of parallel computing
Improving effective bandwidth through compiler enhancement of global cache reuse
Journal of Parallel and Distributed Computing
Using Scratchpad to Exploit Object Locality in Java
ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Analyzing the Energy-Time Trade-Off in High-Performance Computing Applications
IEEE Transactions on Parallel and Distributed Systems
A tuning framework for software-managed memory hierarchies
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Instruction balance and its relation to program energy consumption
LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
A modular coprocessor architecture for embedded real-time image and video signal processing
SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
Static reuse distances for locality-based optimizations in MATLAB
Proceedings of the 24th ACM International Conference on Supercomputing
Generalized index-set splitting
CC'05 Proceedings of the 14th international conference on Compiler Construction
Embedded Systems Design
Removing impediments to loop fusion through code transformations
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Improving the memory bandwidth utilization using loop transformations
PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Iterative collective loop fusion
CC'06 Proceedings of the 15th international conference on Compiler Construction
Analysis of Performance Bottlenecks in Multithreaded Multiprocessor Systems
Fundamenta Informaticae - Application of Concurrency to System Design
A coldness metric for cache optimization
Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
As the speed gap between CPU and memory widens, memory hierarchy has become the primary factor limiting program performance. Until now, the principal focus of hardware and software innovations has been overcoming latency. However, the advent of latency tolerance techniques such as non-blocking cache and software prefetching begins the process of trading bandwidth for latency by overlapping and pipelining memory transfers. Since actual latency is the inverse of the consumed bandwidth, memory latency cannot be fully tolerated without infinite bandwidth. This perspective has led us to two questions. Do current machines provide sufficient data bandwidth? If not, can a program be restructured to consume less bandwidth?This paper answers these questions in two parts. The first part defines a new bandwidth-based performance model and demonstrates the serious performance bottleneck due to the lack of memory bandwidth. The second part describes a new set of compiler optimizations for reducing bandwidth consumption of programs. The optimizations are bandwidth-minimal loop fusion, array shrinking and peeling, and store elimination.