Seamlessly portable applications: Managing the diversity of modern heterogeneous systems

  • Authors:
  • Mario Kicherer;Fabian Nowak;Rainer Buchty;Wolfgang Karl

  • Affiliations:
  • Karlsruhe Institute of Technology, Germany;Karlsruhe Institute of Technology, Germany;Karlsruhe Institute of Technology, Germany;Karlsruhe Institute of Technology, Germany

  • Venue:
  • ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Nowadays, many possible configurations of heterogeneous systems exist, posing several new challenges to application development: different types of processing units usually require individual programming models with dedicated runtime systems and accompanying libraries. If these are absent on an end-user system, e.g. because the respective hardware is not present, an application linked against these will break. This handicaps portability of applications being developed on one system and executed on other, differently configured heterogeneous systems. Moreover, the individual profit of different processing units is normally not known in advance. In this work, we propose a technique to effectively decouple applications from their accelerator-specific parts, respectively code. These parts are only linked on demand and thereby an application can be made portable across systems with different accelerators. As there are usually multiple hardware-specific implementations for a certain task, e.g., a CPU and a GPU version, a method is required to determine which are usable at all and which one is most suitable for execution on the current system. With our approach, application and hardware programmers can express the requirements and the abilities of the application and the hardware-specific implementations in a simplified manner. During runtime, the requirements and abilities are compared with regard to the present hardware in order to determine the usable implementations of a task. If multiple implementations are usable, an online-learning history-based selector is employed to determine the most efficient one. We show that our approach chooses the fastest usable implementation dynamically on several systems while introducing only a negligible overhead itself. Applied to an MPI application, our mechanism enables exploitation of local accelerators on different heterogeneous hosts without preliminary knowledge or modification of the application.