Optimizing explicit data transfers for data parallel applications on the cell architecture

Authors:
Selma Saidi;Pranav Tendulkar;Thierry Lepley;Oded Maler
Affiliations:
Verimag Lab, University of Grenoble and STMicroelectronics Grenoble, France;Verimag Lab, University of Grenoble, France;STMicroelectronics Grenoble, France;CNRS-Verimag Lab, Grenoble, France
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Year:
2012

Citing 27
Cited 1

A model for hierarchical memory

STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A bridging model for parallel computation

Communications of the ACM
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Key features of the design methodology enabling a multi-core SoC implementation of a first-generation CELL processor

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
ROS-DMA: A DMA Double Buffering Method for Embedded Image Processing with Resource Optimized Slicing

RTAS '06 Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming
SPENK: adding another level of parallelism on the cell broadband engine

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing assignment of threads to SPEs on the cell BE processor

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Programming Multiprocessors with Explicitly Managed Memory Hierarchies

Computer
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Modeling multigrain parallelism on heterogeneous multi-core processors: a case study of the cell BE

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
Buffer sizing for self-timed stream programs on heterogeneous distributed memory multiprocessors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known double buffering technique to bring the data from the off-chip slow memory to the local memory of the cores via a DMA (direct memory access) mechanism. Based on the computation time and size of elementary data items as well as DMA characteristics, we derive optimal and near optimal values for the number of blocks that should be clustered in a single DMA command. We then extend the results to the case where a computation for one data item needs some data in its neighborhood. In this setting we characterize the performance of several alternative mechanisms for data sharing. Our models are validated experimentally using a cycle-accurate simulator of the Cell Broadband Engine architecture.