Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Clustered affinity scheduling on large-scale NUMA multiprocessors
Journal of Systems and Software
Hierarchical loop scheduling for clustered NUMA machines
Journal of Systems and Software
Strategies for Improving Data Locality in Embedded Applications
ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures
Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Dynamic Task and Data Placement over NUMA Architectures: An OpenMP Runtime Perspective
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Efficient OpenMP support and extensions for MPSoCs with explicitly managed memory hierarchy
Proceedings of the Conference on Design, Automation and Test in Europe
Vertical stealing: robust, locality-aware do-all workload distribution for 3D MPSoCs
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs
IEEE Transactions on Computers
SGRT: a scalable mobile GPU architecture based on ray tracing
ACM SIGGRAPH 2012 Talks
Optimal 2D Data Partitioning for DMA Transfers on MPSoCs
DSD '12 Proceedings of the 2012 15th Euromicro Conference on Digital System Design
Hi-index | 0.00 |
The advent of many-core chips in the embedded world imposes new challenges to programmers. One of the most important challenges to achieve optimal performance is that the variance in memory access time depending on the issuing core and the location of the data has to be taken into consideration in the already complex environment of parallel applications. In this paper we seek efficient ways of implementing dynamic task and data placement for embedded many-core systems that exhibit NUMA. Achieving a high affinity between a task and its data is desirable to minimize memory access cost and program execution time. This research was carried out using a many-core MPSoC FPGA prototype from Samsung. The MPSoC comprises 16 reconfigurable processors, global SDRAM and distributed scratchpad memory. Since no message passing interface or API's such as OpenMP is available for the Samsung 16-SRP, this paper focuses on optimizations in the application layer. To gain insight into memory performance, a series of micro benchmarks has been carried out. These benchmarks aim at discovering NUMA behavior, memory hierarchy latencies, architecture symmetry, DMA-transfer overhead and congestion. Accessing the SDRAM exhibits NUMA factors up to 1.45, however, the most important factor for achieving high performance is to minimize congestion in the NoC and at the memory controllers. We have applied these findings to two real-world imaging applications, image-blurring and Seeded Region Growing (SRG). The base cases for both applications are parallelized for 16 cores. For image burring, we achieve a 2.75- to 3-fold speedup by distributing the image data to the four SDRAMs and statically assigning the work items to the cores closest to the data. Depending on input image, the SRG application shows a 1.4- to 1.5-fold speedup compared to the base case by applying dynamic work load balancing using hierarchical queues and work-stealing. The results emphasize the importance of considering the underlying structure of the memory architecture in order to achieve high performance on embedded many-core chips. The goal of our research is to in the future implement automatic workload distribution for a parallelization framework such as OpenCL.