Direct addressed caches for reduced power consumption
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
POWER5 System microarchitecture
IBM Journal of Research and Development - POWER5 and packaging
Dynamic allocation for scratch-pad memory using compile-time decisions
ACM Transactions on Embedded Computing Systems (TECS)
Software-based instruction caching for embedded processors
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Prefetching irregular references for software cache on cell
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Orchestrating data transfer for the cell/B.E. processor
Proceedings of the 22nd annual international conference on Supercomputing
Optimizing the use of static buffers for DMA on a CELL chip
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming
DBDB: optimizing DMATransfer for the cell be architecture
Proceedings of the 23rd international conference on Supercomputing
An efficient software cache for H.264 motion compensation
SOC'09 Proceedings of the 11th international conference on System-on-chip
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Adaptive line size cache for irregular references on cell multicore processor
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Optimization of FDTD computations in a streaming model architecture
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Compiler-directed memory management for heterogeneous MPSoCs
Journal of Systems Architecture: the EUROMICRO Journal
DDM-VMc: the data-driven multithreading virtual machine for the cell processor
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
An instruction to accelerate software caches
ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
OpenMP extensions for heterogeneous architectures
IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic data distribution for improving data locality on the cell BE architecture
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Streaming model computation of the FDTD problem
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Proceedings of the 9th conference on Computing Frontiers
Integrating software caches with scratch pad memory
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
A transactional runtime system for the Cell/BE architecture
Journal of Parallel and Distributed Computing
Hardware-software coherence protocol for the coexistence of caches and local memories
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A Multidimensional Software Cache for Scratchpad-Based Systems
International Journal of Embedded and Real-Time Communication Systems
SemCache: semantics-aware caching for efficient GPU offloading
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.