A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data locality enhancement by memory reduction
ICS '01 Proceedings of the 15th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Increasing temporal locality with skewing and recursive blocking
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Time Skewing for Parallel Computers
LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Collective Loop Fusion for Array Contraction
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
StreamIt: A Language for Streaming Applications
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Optimizing supercompilers for supercomputers
Optimizing supercompilers for supercomputers
Effective automatic parallelization of stencil computations
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Speculative Decoupled Software Pipelining
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Orchestrating the execution of stream programs on multicore platforms
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Hi-index | 0.00 |
This paper presents a strategy that integrates a set of compiler optimizations and analysis techniques that enable the detection and transformation of time step loops for efficient execution on multicore platforms. Time-step computations, which appear frequently in scientific applications, are amenable to pipelined parallelism and exhibit a high degree of temporal locality. However, striking the right balance between data locality and parallelism often proves difficult, particularly for current multicore architectures where one or more levels of the memory hierarchy is shared among multiple processing units. Our proposed strategy addresses performance issues related to both data locality and parallelism. By carefully orchestrating a set of source-to-source transformations, our technique exposes fine-grain parallelism within a time-step loop, while improving its cache utilization and reducing its bandwidth requirements. Preliminary experiments with two time-step applications on three multicore platforms show that that the code variants generated by our strategy have significantly fewer misses in the shared caches and also achieve better execution times through reduced synchronization costs.