Loop tiling for parallelism
Optimizing Supercompilers for Supercomputers
Optimizing Supercompilers for Supercomputers
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Imagine: Media Processing with Streams
IEEE Micro
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the 30th annual international symposium on Computer architecture
Programmable Stream Processors
Computer
The vlsi implementation and evaluation of area- and energy-efficient streaming media processors
The vlsi implementation and evaluation of area- and energy-efficient streaming media processors
Memory hierarchy design for stream computing
Memory hierarchy design for stream computing
Scientific computing applications on the imagine stream processor
ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
A 64-bit stream processor architecture for scientific applications
Proceedings of the 34th annual international symposium on Computer architecture
Implementation and evaluation of Jacobi iteration on the imagine stream processor
HiPC'07 Proceedings of the 14th international conference on High performance computing
Architecture-based optimization for mapping scientific applications to imagine
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
Despite Imagine presents an efficient memory hierarchy, the straightforward programming of scientific applications does not match the available memory hierarchy and thereby constrains the performance of stream applications. In this paper, we explore a novel matrix-based programming optimization for improving the memory hierarchy performance to sustain the operands needed for highly parallel computation. Our specific contributions include that we formulate the problem on the Data&Computation Matrix (D&C Matrix) that is proposed to abstract the relationship between streams and kernels, and present the key techniques for improving the multilevel bandwidth utilization based on this matrix. The experimental evaluation on five representative scientific applications shows that the new stream programs yielded by our optimization can effectively enhance the locality in LRF and SRF, improve the capacity utilization of LRF and SRF, make the best use of SPs and SBs, and avoid index stream overhead.