Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

Authors:
Li Wang;Jingling Xue;Xuejun Yang
Affiliations:
School of Computer, National University of Defense Technology, Changsha, China 410073;School of Computer Science and Engineering, UNSW, Sydney, Australia 2052;School of Computer, National University of Defense Technology, Changsha, China 410073
Venue:
The Journal of Supercomputing
Year:
2012

Citing 27
Cited 0

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Iterative modulo scheduling: an algorithm for software pipelining loops

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Optimum modulo schedules for minimum register requirements

ICS '95 Proceedings of the 9th international conference on Supercomputing
Unrolling-based optimizations for modulo scheduling

Proceedings of the 28th annual international symposium on Microarchitecture
Reuse-Driven Tiling for Data Locality

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Scratchpad memory: design alternative for cache on-chip memory in embedded systems

Proceedings of the tenth international symposium on Hardware/software codesign
Media Processing Applications on the Imagine Stream Processor

ICCD '02 Proceedings of the 2002 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD'02)
Swing Modulo Scheduling: A Lifetime-Sensitive Approach

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
The Stream Virtual Machine

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Stream Programming on General-Purpose Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Compiling for stream processing

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A 64-bit stream processor architecture for scientific applications

Proceedings of the 34th annual international symposium on Computer architecture
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
GRAPE-DR: 2-Pflops massively-parallel computer with 512-core, 512-Gflops processor chips for scientific computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Optimizing scientific application loops on stream processors

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Transform coding on programmable stream processors

The Journal of Supercomputing
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Matrix-based streamization approach for improving locality and parallelism on FT64 stream processor

The Journal of Supercomputing
Compiler assisted architectural exploration framework for coarse grained reconfigurable arrays

The Journal of Supercomputing
Modulo scheduling without overlapped lifetimes

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Exploiting the reuse supplied by loop-dependent stream references for stream processors

ACM Transactions on Architecture and Code Optimization (TACO)
Reuse-aware modulo scheduling for stream processors

Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

Both reuse and concurrency are performance-critical for stream processors. When applying loop unrolling and software pipelining separately to stream-level loops, either reuse or concurrency or both may be inadequately exploited. In this paper, we optimize modulo scheduling to maximize stream reuse and improve concurrency for stream-level loops. The key insight is that an unrolled and software-pipelined stream-level loop could be described by a set of reuse equations. Guided by reuse equations, a reuse-aware modulo scheduling algorithm is developed to simultaneously optimize the two performance objectives, reuse, and concurrency, for a loop in a unified framework. Moreover, we describe a code generation algorithm to automatically produce the optimized loop from a given loop. The experimental results obtained on FT64 and by simulation demonstrate the effectiveness of the proposed approach.