A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Iterative modulo scheduling: an algorithm for software pipelining loops
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Optimum modulo schedules for minimum register requirements
ICS '95 Proceedings of the 9th international conference on Supercomputing
Unrolling-based optimizations for modulo scheduling
Proceedings of the 28th annual international symposium on Microarchitecture
Reuse-Driven Tiling for Data Locality
LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Scratchpad memory: design alternative for cache on-chip memory in embedded systems
Proceedings of the tenth international symposium on Hardware/software codesign
Media Processing Applications on the Imagine Stream Processor
ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Swing Modulo Scheduling: A Lifetime-Sensitive Approach
PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Merrimac: Supercomputing with Streams
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Stream Programming on General-Purpose Processors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Compiling for stream processing
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A 64-bit stream processor architecture for scientific applications
Proceedings of the 34th annual international symposium on Computer architecture
Comparing memory systems for chip multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Orchestrating the execution of stream programs on multicore platforms
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Optimizing scientific application loops on stream processors
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Transform coding on programmable stream processors
The Journal of Supercomputing
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor
The Journal of Supercomputing
Compiler assisted architectural exploration framework for coarse grained reconfigurable arrays
The Journal of Supercomputing
Modulo scheduling without overlapped lifetimes
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Exploiting the reuse supplied by loop-dependent stream references for stream processors
ACM Transactions on Architecture and Code Optimization (TACO)
Reuse-aware modulo scheduling for stream processors
Proceedings of the Conference on Design, Automation and Test in Europe
Hi-index | 0.00 |
Both reuse and concurrency are performance-critical for stream processors. When applying loop unrolling and software pipelining separately to stream-level loops, either reuse or concurrency or both may be inadequately exploited. In this paper, we optimize modulo scheduling to maximize stream reuse and improve concurrency for stream-level loops. The key insight is that an unrolled and software-pipelined stream-level loop could be described by a set of reuse equations. Guided by reuse equations, a reuse-aware modulo scheduling algorithm is developed to simultaneously optimize the two performance objectives, reuse, and concurrency, for a loop in a unified framework. Moreover, we describe a code generation algorithm to automatically produce the optimized loop from a given loop. The experimental results obtained on FT64 and by simulation demonstrate the effectiveness of the proposed approach.