Mapping of micro data flow computations on parallel microarchitectures
MICRO 21 Proceedings of the 21st annual workshop on Microprogramming and microarchitecture
Proceedings of the 1st International Conference on Supercomputing
Automatic synthesis of concurrent control for multiprocessor systems of general topology through fine-grain mapping
Distributed k-selection: From a sequential to a distributed algorithm
PODC '83 Proceedings of the second annual ACM symposium on Principles of distributed computing
Hi-index | 0.00 |
Given a graph pair, an acyclic task data-flow graph (DFG) and a processor network topology graph with 2-way communication channels, the latency-adaptive A* parallel resource mapping produces an efficient task execution schedule that can also be used to quantify the quality of a parallel software/hardware match. The network latency adaptive parallel mapping framework, from static task DFG, to parallel processor network topology graph, is aimed at automatically optimizing workflow task scheduling among computation cluster nodes or subnets, including CPU, multicore, VLIW and co-processor accelerators such as GPUs, DSPs, FPGA fabric blocks, etc. The latency-adaptive parallel mapper starts scheduling by assigning the highest priority task a centrally located, capable processor in the network topology, and then conservatively assigns additional nearby, capable network processor cores only as needed to improve computation efficiency with fewest, yet sufficient processors scheduled. For slower communication with high inter/intra-processor latency ratios, the adaptive parallel mapper automatically opts for fewer processor cores, or even schedules just a single sequential processor, over parallel processing. The examples tested on a simulated adaptive mapper, demonstrate that the latency-adaptive parallel resource mapping successfully achieves better cost-efficiency in comparison to fixed task-to-processor mapping, in nearly optimal speedup, using only fewer nearby processors, resulting in only 1 or no processor/switch hop in around 90% of the data transfers. Inversely for faster networks, more processors are scheduled automatically due to lower inter-processor latency. In extreme cases, where offloading next task to another processor may be faster than waiting for a processor to finish the current task (i.e., when inter/intra-processor latency ratio pipeline processing can outperform parallel processing, offering a surprising bonus in this parallel resource mapping study.