A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An analytical approach to performance/cost modeling of parallel computers
Journal of Parallel and Distributed Computing
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
Simple vector microprocessors for multimedia applications
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A vectorizing compiler for multimedia extensions
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
ICS '01 Proceedings of the 15th international conference on Supercomputing
Automatic intra-register vectorization for the Intel architecture
International Journal of Parallel Programming
Vectorizing for a SIMdD DSP architecture
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Superword-Level Parallelism in the Presence of Control Flow
Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Proceedings of the international symposium on Code generation and optimization
An integrated simdization framework using virtual vectors
Proceedings of the 19th annual international conference on Supercomputing
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Software-based instruction caching for embedded processors
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
CellSs: a programming model for the cell BE architecture
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors
Proceedings of the International Symposium on Code Generation and Optimization
MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI
IBM Journal of Research and Development
CellSs: making it easier to program the cell broadband engine processor
IBM Journal of Research and Development
Cell GC: using the cell synergistic processor as a garbage collection coprocessor
Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Proceedings of the 2008 Asia and South Pacific Design Automation Conference
Vectorized data processing on the cell broadband engine
DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine
Proceedings of the 5th conference on Computing frontiers
Dma-based prefetching for i/o-intensive workloads on the cell architecture
Proceedings of the 5th conference on Computing frontiers
An experimental study of sorting and branch prediction
Journal of Experimental Algorithmics (JEA)
Orchestrating data transfer for the cell/B.E. processor
Proceedings of the 22nd annual international conference on Supercomputing
Loading OpenMP to Cell: An Effective Compiler Framework for Heterogeneous Multi-core Chip
IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
A Constraint Programming Approach for Allocation and Scheduling on the CELL Broadband Engine
CP '08 Proceedings of the 14th international conference on Principles and Practice of Constraint Programming
A tuning framework for software-managed memory hierarchies
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Hybrid access-specific software cache techniques for the cell BE architecture
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Performance analysis and visualization tools for cell/B.E. multicore environment
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture
Languages and Compilers for Parallel Computing
Petascale computing with accelerators
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Mapping and Synchronizing Streaming Applications on Cell Processors
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Evaluation of memory performance on the cell BE with the SARC programming model
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore
Proceedings of the 2009 ACM symposium on Applied Computing
Implementation of a wide-angle lens distortion correction algorithm on the cell broadband engine
Proceedings of the 23rd international conference on Supercomputing
DBDB: optimizing DMATransfer for the cell be architecture
Proceedings of the 23rd international conference on Supercomputing
Stream Compilation for Real-Time Embedded Multicore Systems
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Optimized on-chip pipelining of memory-intensive computations on the cell BE
ACM SIGARCH Computer Architecture News
Dynamic code footprint optimization for the IBM Cell Broadband Engine
IWMSE '09 Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering
Adaptive Fault Tolerance for Scalable Cluster Computing in Space
International Journal of High Performance Computing Applications
Exploiting Locality on the Cell/B.E. through Bypassing
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Achieving high memory performance from heterogeneous architectures with the SARC programming model
Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Building heterogeneous reconfigurable systems with a hardware microkernel
CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Streams: emerging from a shared memory model
IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
State-of-the-art in heterogeneous computing
Scientific Programming
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
DMATiler: revisiting loop tiling for direct memory access
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Heap data management for limited local memory (LLM) multi-core processors
CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Adaptive line size cache for irregular references on cell multicore processor
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Automatic program parallelization for multicore processors
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
DDM-VMc: the data-driven multithreading virtual machine for the cell processor
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies
The Journal of Supercomputing
Importance of explicit vectorization for CPU and GPU software performance
Journal of Computational Physics
Making the Best of Temporal Locality: Just-in-Time Renaming and Lazy Write-Back on the Cell/B.E
International Journal of High Performance Computing Applications
Transactions on high-performance embedded architectures and compilers III
Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
International Journal of Communication Networks and Distributed Systems
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations
PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Proceedings of the 9th conference on Computing Frontiers
For extreme parallelism, your OS is Sooooo last-millennium
HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Hardware-software coherence protocol for the coexistence of caches and local memories
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallelization strategies for the points of interests algorithm on the cell processor
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
A software-only scheme for managing heap data on limited local memory(LLM) multicore processors
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.01 |
The continuing importance of game applications and other numerically intensive workloads has generated an upsurge in novel computer architectures tailored for such functionality. Game applications feature highly parallel code for functions such as game physics, which have high computation and memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband EngineTM architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPC® processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.