Fine-grain access control for distributed shared memory
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A Loop Transformation Algorithm for Communication Overlapping
International Journal of Parallel Programming - Special issue on international symposium on high performance computing 1997, part I
A synthesis of memory mechanisms for distributed architectures
ICS '01 Proceedings of the 15th international conference on Supercomputing
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Removing the overhead from software-based shared memory
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Scratchpad memory: design alternative for cache on-chip memory in embedded systems
Proceedings of the tenth international symposium on Hardware/software codesign
CRL: High Performance All-Software Distributed Shared Memory
CRL: High Performance All-Software Distributed Shared Memory
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Prefetching irregular references for software cache on cell
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Orchestrating data transfer for the cell/B.E. processor
Proceedings of the 22nd annual international conference on Supercomputing
Hybrid access-specific software cache techniques for the cell BE architecture
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
COMIC: a coherent shared memory interface for cell be
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing the use of static buffers for DMA on a CELL chip
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
DMATiler: revisiting loop tiling for direct memory access
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Proceedings of the 9th conference on Computing Frontiers
Hi-index | 0.00 |
In heterogeneous multi-core systems, such as the Cell BE or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence. It is software's responsibility to dynamically transfer the working set when the total data set is too large to fit in the local memory. The data can be transferred through a software controlled cache which maintains correctness and exploits reuse among references, especially when complicated aliasing or data dependence exists. However, the software controlled cache introduces the extra overhead of cache lookup. In this paper we present the design and implementation of a Direct Blocking Data Buffer (DBDB) which combines compiler analysis and runtime management to optimize local memory utilization. We use compile time analysis to identify regular references in a loop body, block the innermost loop according to the access patterns and available local memory space, insert DMA operations for the blocked loop, and substitute references to local buffers. The runtime is responsible for allocating local memory for DBDB, especially for disambiguating aliased memory accesses which could not be resolved at compile time. We further optimize noncontiguous references by taking advantage of the DMA-list feature provided by the Cell BE. A practical performance model is presented to guide the DMA transfer scheme selection among single-DMA, multi-DMA and DMA-list. We have implemented DBDB in the IBM XL C/C++ for Multicore Acceleration for Linux, and have conducted experiments with selected test cases from the NAS OpenMP and SPEC benchmarks. The results show that our method performs well compared with traditional software cache approach. We have observed a speedup of up to 5.3x and 4x in average.