Application task and data placement in embedded many-core NUMA architectures

  • Authors:
  • Karl Viring;Sangheon Lee;Yeongon Cho;Soojung Ryu;Bernhard Egger

  • Affiliations:
  • Seoul National University, Korea;Samsung Advanced Institute of Technology, Korea;Samsung Advanced Institute of Technology, Korea;Samsung Advanced Institute of Technology, Korea;Seoul National University, Korea

  • Venue:
  • Proceedings of the 10th Workshop on Optimizations for DSP and Embedded Systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The advent of many-core chips in the embedded world imposes new challenges to programmers. One of the most important challenges to achieve optimal performance is that the variance in memory access time depending on the issuing core and the location of the data has to be taken into consideration in the already complex environment of parallel applications. In this paper we seek efficient ways of implementing dynamic task and data placement for embedded many-core systems that exhibit NUMA. Achieving a high affinity between a task and its data is desirable to minimize memory access cost and program execution time. This research was carried out using a many-core MPSoC FPGA prototype from Samsung. The MPSoC comprises 16 reconfigurable processors, global SDRAM and distributed scratchpad memory. Since no message passing interface or API's such as OpenMP is available for the Samsung 16-SRP, this paper focuses on optimizations in the application layer. To gain insight into memory performance, a series of micro benchmarks has been carried out. These benchmarks aim at discovering NUMA behavior, memory hierarchy latencies, architecture symmetry, DMA-transfer overhead and congestion. Accessing the SDRAM exhibits NUMA factors up to 1.45, however, the most important factor for achieving high performance is to minimize congestion in the NoC and at the memory controllers. We have applied these findings to two real-world imaging applications, image-blurring and Seeded Region Growing (SRG). The base cases for both applications are parallelized for 16 cores. For image burring, we achieve a 2.75- to 3-fold speedup by distributing the image data to the four SDRAMs and statically assigning the work items to the cores closest to the data. Depending on input image, the SRG application shows a 1.4- to 1.5-fold speedup compared to the base case by applying dynamic work load balancing using hierarchical queues and work-stealing. The results emphasize the importance of considering the underlying structure of the memory architecture in order to achieve high performance on embedded many-core chips. The goal of our research is to in the future implement automatic workload distribution for a parallelization framework such as OpenCL.