Parallel programming patterns for multi-processor SoC: Application to video processing

  • Authors:
  • Pierre G. Paulin;Ali Erdem Özcan;Vincent Gagné;Bruno Lavigueur;Olivier Benny

  • Affiliations:
  • STMicroelectronics Inc., Ottawa, Canada;STMicroelectronics Inc., Ottawa, Canada;STMicroelectronics Inc., Ottawa, Canada;STMicroelectronics Inc., Ottawa, Canada;STMicroelectronics Inc., Ottawa, Canada

  • Venue:
  • ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficient, scalable and productive parallel programming is a major challenge for exploiting the future multi-processor SoC platforms. This article presents the MultiFlex programming environment which has been developed to address this challenge. It is targeted for use on Platform 2012, a scalable multi-processor fabric. The MultiFlex environment supports high-level simulation, iterative platform mapping, and includes tools for programming model aware debug, trace, visualization and analysis. This article focuses on the two classes of programming abstractions supported in MultiFlex. The first is a set of Parallel Programming Patterns (PPP) which offer a rich set of programming abstractions for implementing efficient data- and task-level parallel applications. The second is a Reactive Task Management (RTM) abstraction, which offers a lightweight C-based API to support dynamic dispatching of small grain tasks on tightly coupled parallel processing resources. The use of the MultiFlex native programming model is illustrated through the capture and mapping of two representative video applications. The first is a high-quality rescaling (HQR) application on a multi-processor platform. We present the details of the optimization process which was required for mapping the HQR application, for which the reference code requires 350 GIPS (giga instructions per second), onto a 16 processor cluster. Our results show that the parallel implementation using the PPP model offers almost linear acceleration with respect to the number of processing elements. The second application is a high-definition VC-1 decoder. For this application, we illustrate two different parallel programming model variants, one using PPPs, the other based on RTM. These two versions are mapped onto two variants of a homogeneous version of the Platform 2012 multi-core fabric.