Adaptive latency-aware parallel resource mapping: task graph scheduling onto heterogeneous network topology

  • Authors:
  • Liwen Shih

  • Affiliations:
  • Computer Engineering, Houston, TX

  • Venue:
  • Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given a graph pair, an acyclic task data-flow graph (DFG) and a processor network topology graph with 2-way communication channels, the latency-adaptive A* parallel resource mapping produces an efficient task execution schedule that can also be used to quantify the quality of a parallel software/hardware match. The network latency adaptive parallel mapping framework, from static task DFG, to parallel processor network topology graph, is aimed at automatically optimizing workflow task scheduling among computation cluster nodes or subnets, including CPU, multicore, VLIW and co-processor accelerators such as GPUs, DSPs, FPGA fabric blocks, etc. The latency-adaptive parallel mapper starts scheduling by assigning the highest priority task a centrally located, capable processor in the network topology, and then conservatively assigns additional nearby, capable network processor cores only as needed to improve computation efficiency with fewest, yet sufficient processors scheduled. For slower communication with high inter/intra-processor latency ratios, the adaptive parallel mapper automatically opts for fewer processor cores, or even schedules just a single sequential processor, over parallel processing. The examples tested on a simulated adaptive mapper, demonstrate that the latency-adaptive parallel resource mapping successfully achieves better cost-efficiency in comparison to fixed task-to-processor mapping, in nearly optimal speedup, using only fewer nearby processors, resulting in only 1 or no processor/switch hop in around 90% of the data transfers. Inversely for faster networks, more processors are scheduled automatically due to lower inter-processor latency. In extreme cases, where offloading next task to another processor may be faster than waiting for a processor to finish the current task (i.e., when inter/intra-processor latency ratio pipeline processing can outperform parallel processing, offering a surprising bonus in this parallel resource mapping study.