Dependence-based code generation for a CELL processor

Authors:
Yuan Zhao;Ken Kennedy
Affiliations:
Computer Science Department, Rice University, Houston , TX;Computer Science Department, Rice University, Houston , TX
Venue:
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Year:
2006

Citing 17
Cited 7

Automatic decomposition of scientific programs for parallel execution

POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Vector Register Allocation

IEEE Transactions on Computers
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Automatic intra-register vectorization for the Intel architecture

International Journal of Parallel Programming
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Dependence analysis for subscripted variables and its application to program transformations

Dependence analysis for subscripted variables and its application to program transformations
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Scalarization on Short Vector Machines

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Applying data copy to improve memory performance of general array computations

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing

Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

Parallel Computing
Implementation of OpenMP Work-Sharing on the Cell Broadband Engine Architecture

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
Scheduling dynamic parallelism on accelerators

Proceedings of the 6th ACM conference on Computing frontiers
Implementation of a wide-angle lens distortion correction algorithm on the cell broadband engine

Proceedings of the 23rd international conference on Supercomputing
Design and implementation of stream processing system and library for CELL broadband engine processors

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Automatic data distribution for improving data locality on the cell BE architecture

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Communication-free data alignment for arrays with exponential references in parallelizing compilers for scalable parallel systems

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Obtaining high performance on the STI CELL processor requires substantial programming effort because its architectural features must be explicitly managed, with separate codes required for two different types of cores (PPE and SPE). Research at IBM has developed a single source-image compiler for CELL that performs vectorization but uses OpenMP to specify cross-core parallelism. In this paper, we present and evaluate an alternative dependence-based compiler approach that automatically generates parallel and vector code for CELL from a single source program with no parallelism directives. In contrast to OpenMP, our approach can also handle loop nests that carry dependences. To preserve correct program semantics, we employ on-chip communication mechanisms to implement barrier and unidirectional synchronization primitives. We also implement strategies to boost performance by managing DMA data movement, improving data alignment, and exploiting memory reuse in the innermost loop.