Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
Profile guided code positioning
PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Effective compiler support for predicated execution using the hyperblock
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled data prefetching
Tolerating latency through software-controlled data prefetching
CRL: high-performance all-software distributed shared memory
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
An integrated compile-time/run-time software distributed shared memory system
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Operating Systems: Program overlay techniques
Communications of the ACM
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
ICS '01 Proceedings of the 15th international conference on Supercomputing
ILP-based Instruction Scheduling for IA-64
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
High performance Fortran compilation techniques for parallelizing scientific codes
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A compiler approach to fast hardware design space exploration in FPGA-based systems
PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Removing the overhead from software-based shared memory
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Automatic intra-register vectorization for the Intel architecture
International Journal of Parallel Programming
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Optimizing OpenMP programs on software distributed shared memory systems
International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
Superword-Level Parallelism in the Presence of Control Flow
Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Proceedings of the international symposium on Code generation and optimization
An integrated simdization framework using virtual vectors
Proceedings of the 19th annual international conference on Supercomputing
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
Optimizing compiler for shared-memory multiple SIMD architecture
Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Multiple Instruction Stream Processor
Proceedings of the 33rd annual international symposium on Computer Architecture
MPI Microtask for programming the cell broadband engineTM processor
IBM Systems Journal
Dynamic multigrain parallelization on the cell broadband engine
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Compilation for explicitly managed memory hierarchies
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 21st annual international conference on Supercomputing
Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell
Journal of VLSI Signal Processing Systems
Executing stream joins on the cell processor
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI
IBM Journal of Research and Development
Cell broadband engine architecture and its first implementation: a performance view
IBM Journal of Research and Development
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor
International Journal of Parallel Programming
Cell GC: using the cell synergistic processor as a garbage collection coprocessor
Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Prefetching irregular references for software cache on cell
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Compiling for vector-thread architectures
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Optimization strategies for a java virtual machine interpreter on the cell broadband engine
Proceedings of the 5th conference on Computing frontiers
Efficient computation of sum-products on GPUs through software-managed cache
Proceedings of the 22nd annual international conference on Supercomputing
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
A Constraint Programming Approach for Allocation and Scheduling on the CELL Broadband Engine
CP '08 Proceedings of the 14th international conference on Principles and Practice of Constraint Programming
Managing Multicore with OpenMP (Extended Abstract)
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor
Languages and Compilers for Parallel Computing
Exploiting SIMD Parallelism with the CGiS Compiler Framework
Languages and Compilers for Parallel Computing
A tuning framework for software-managed memory hierarchies
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Hybrid access-specific software cache techniques for the cell BE architecture
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
COMIC: a coherent shared memory interface for cell be
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
International Journal of Parallel Programming
Scientific Programming - High Performance Computing with the Cell Broadband Engine
CellJoin: a parallel stream join operator for the cell processor
The VLDB Journal — The International Journal on Very Large Data Bases
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore
Proceedings of the 2009 ACM symposium on Applied Computing
Towards automatic program partitioning
Proceedings of the 6th ACM conference on Computing frontiers
Scheduling dynamic parallelism on accelerators
Proceedings of the 6th ACM conference on Computing frontiers
Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit
Journal of Signal Processing Systems
Automatic parallelization for graphics processing units
PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Towards a framework for abstracting accelerators in parallel applications: experience with cell
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Brain derived vision algorithm on high performance architectures
International Journal of Parallel Programming
Compiling Python to a hybrid execution environment
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
ACM Transactions on Architecture and Code Optimization (TACO)
Dependence-based code generation for a CELL processor
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Optimizing the use of static buffers for DMA on a CELL chip
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
CG-Cell: an NPB benchmark implementation on cell broadband engine
ICDCN'08 Proceedings of the 9th international conference on Distributed computing and networking
Proceedings of the 7th ACM international conference on Computing frontiers
State-of-the-art in heterogeneous computing
Scientific Programming
MapReduce for the cell broadband engine architecture
IBM Journal of Research and Development
Accelerating large-scale DEVS-based simulation on the cell processor
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Monte Carlo implementation of financial simulation on Cell/B.E. multi-core processor
Mathematics and Computers in Simulation
A configurable framework for stream programming exploration in baseband applications
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Bothnia: a dual-personality extension to the Intel integrated graphics driver
ACM SIGOPS Operating Systems Review
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies
The Journal of Supercomputing
Programming heterogeneous clusters with accelerators using object-based programming
Scientific Programming
International Journal of Communication Networks and Distributed Systems
Vector class on limited local memory (LLM) multi-core processors
CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Branch penalty reduction on IBM cell SPUs via software branch hinting
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A comparison of three commodity-level parallel architectures: multi-core CPU, cell BE and GPU
MMCS'08 Proceedings of the 7th international conference on Mathematical Methods for Curves and Surfaces
Safe and familiar multi-core programming by means of a hybrid functional and imperative language
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic data distribution for improving data locality on the cell BE architecture
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Proceedings of the 9th conference on Computing Frontiers
Elastic computing: A portable optimization framework for hybrid computers
Parallel Computing
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
The RACECAR heuristic for automatic function specialization on multi-core heterogeneous systems
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
A transactional runtime system for the Cell/BE architecture
Journal of Parallel and Distributed Computing
Hardware-software coherence protocol for the coexistence of caches and local memories
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Video processing and retrieval on cell processor architecture
ICEC'07 Proceedings of the 6th international conference on Entertainment Computing
Parallel execution of Java loops on Graphics Processing Units
Science of Computer Programming
RSVM: a region-based software virtual memory for GPU
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Simple, portable and fast SIMD intrinsic programming: generic simd library
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors
Journal of Signal Processing Systems
Hi-index | 0.00 |
Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler.