Automatic decomposition of scientific programs for parallel execution
POPL '87 Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
IEEE Transactions on Computers
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Automatic intra-register vectorization for the Intel architecture
International Journal of Parallel Programming
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Dependence analysis for subscripted variables and its application to program transformations
Dependence analysis for subscripted variables and its application to program transformations
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Multi-platform Auto-vectorization
Proceedings of the International Symposium on Code Generation and Optimization
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Scalarization on Short Vector Machines
ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Applying data copy to improve memory performance of general array computations
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Implementation of OpenMP Work-Sharing on the Cell Broadband Engine Architecture
IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
Scheduling dynamic parallelism on accelerators
Proceedings of the 6th ACM conference on Computing frontiers
Implementation of a wide-angle lens distortion correction algorithm on the cell broadband engine
Proceedings of the 23rd international conference on Supercomputing
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Automatic data distribution for improving data locality on the cell BE architecture
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Hi-index | 0.00 |
Obtaining high performance on the STI CELL processor requires substantial programming effort because its architectural features must be explicitly managed, with separate codes required for two different types of cores (PPE and SPE). Research at IBM has developed a single source-image compiler for CELL that performs vectorization but uses OpenMP to specify cross-core parallelism. In this paper, we present and evaluate an alternative dependence-based compiler approach that automatically generates parallel and vector code for CELL from a single source program with no parallelism directives. In contrast to OpenMP, our approach can also handle loop nests that carry dependences. To preserve correct program semantics, we employ on-chip communication mechanisms to implement barrier and unidirectional synchronization primitives. We also implement strategies to boost performance by managing DMA data movement, improving data alignment, and exploiting memory reuse in the innermost loop.