Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms

Authors:
Selma Saidi;Pranav Tendulkar;Thierry Lepley;Oded Maler
Affiliations:
-;-;-;-
Venue:
Microprocessors & Microsystems
Year:
2013

Citing 12
Cited 0

Parallel image processing applications on a network of workstations

Parallel Computing
Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming Multiprocessors with Explicitly Managed Memory Hierarchies

Computer
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics applications

Proceedings of the 49th Annual Design Automation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reducing the effects of off-chip memory access latency is a key factor in exploiting efficiently embedded multi-core platforms. We consider architectures that admit a multi-core computation fabric, having its own fast and small memory to which the data blocks to be processed are fetched from external memory using a DMA (direct memory access) engine, employing a double- or multiple-buffering scheme to avoid processor idling. In this paper we focus on application programs that process two-dimensional data arrays and we determine automatically the size and shape of the portions of the data array which are subject to a single DMA call, based on hardware and applications parameters. When the computation on different array elements are completely independent, the asymmetry of memory structure leads always to prefer one-dimensional horizontal pieces of memory, while when the computation of a data element shares some data with its neighbors, there is a pressure for more ''square'' shapes to reduce the amount of redundant data transfers. We provide an analytic model for this optimization problem and validate our results by running a mean filter application on the Cell simulator.