Constant propagation with conditional branches
ACM Transactions on Programming Languages and Systems (TOPLAS)
Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A Loop Transformation Algorithm for Communication Overlapping
International Journal of Parallel Programming - Special issue on international symposium on high performance computing 1997, part I
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration
Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Removing the overhead from software-based shared memory
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Scratchpad memory: design alternative for cache on-chip memory in embedded systems
Proceedings of the tenth international symposium on Hardware/software codesign
Global common subexpression elimination
Proceedings of a symposium on Compiler optimization
Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
CRL: High Performance All-Software Distributed Shared Memory
CRL: High Performance All-Software Distributed Shared Memory
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing the use of static buffers for DMA on a CELL chip
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Hybrid access-specific software cache techniques for the cell BE architecture
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
DBDB: optimizing DMATransfer for the cell be architecture
Proceedings of the 23rd international conference on Supercomputing
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Adaptive line size cache for irregular references on cell multicore processor
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
DDM-VMc: the data-driven multithreading virtual machine for the cell processor
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Integrating software caches with scratch pad memory
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
SPM-Sieve: a framework for assisting data partitioning in scratch pad memory based systems
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hi-index | 0.00 |
In heterogeneous multi-core systems, such as the Cell/B.E. or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence between the local and global memories. It is software's responsibility to dynamically transfer the working set into the local memory when the total data set is too large to fit in the local memory. The data can be transferred through either a software controlled cache or a direct buffer. Such a software cache can maintain correctness and exploit reuse among references, especially when complicated aliasing or data dependences exist. However, the software cache introduces the extra overhead of cache lookup. Direct buffering, on the other hand, is fast but is limited by the compiler's ability to disambiguate memory references. It is desirable to judiciously use both methods, for irregular and regular accesses respectively. However, when a datum resides in both the software cache and the direct buffer, coherence problems occur. In this paper, we propose a solution which provides compile time analysis and runtime maintenance to address this coherence issue. We use compiler analysis to guarantee that there is no access to software cache within the local live range of a direct buffer, and rely on runtime support to update values from or to software cache at the entry or exit of the direct buffer. Further, we present a global data flow analysis design to eliminate redundant coherence maintenance, and overlap computation and DMA accesses to reduce runtime overhead. We have implemented this method in our Single Source Compiler for Cell, and have conducted experiments with the NAS OpenMP benchmarks. The results show that our method maintains correctness while keeping most of the opportunities for direct buffering. The execution performance can increase more than 3x compared to approaches using only the software cache. Furthermore, compile time analysis can reduce 90% of the runtime updates, thereby improving performance by 20% further.