Experiences with UPC on TILE-64 processor

  • Authors:
  • Olivier Serres;Ahmad Anbar;Saumil Merchant;Tarek El-Ghazawi

  • Affiliations:
  • NSF Center for High-Performance Reconfigurable Computing (CHREC), Dept. of Electrical and Computer Engineering, The George Washington University, 801 22nd St NW, 20052, USA;NSF Center for High-Performance Reconfigurable Computing (CHREC), Dept. of Electrical and Computer Engineering, The George Washington University, 801 22nd St NW, 20052, USA;NSF Center for High-Performance Reconfigurable Computing (CHREC), Dept. of Electrical and Computer Engineering, The George Washington University, 801 22nd St NW, 20052, USA;NSF Center for High-Performance Reconfigurable Computing (CHREC), Dept. of Electrical and Computer Engineering, The George Washington University, 801 22nd St NW, 20052, USA

  • Venue:
  • AERO '11 Proceedings of the 2011 IEEE Aerospace Conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

Partitioned global address space (PGAS) programming model presents programmers with a globally shared address space with locality awareness and one-sided communication constructs. The shared address space and the one-sided communication constructs enhance ease-of-use of PGAS based languages and the locality awareness enables programmers and the runtime systems to achieve higher performance. Thus PGAS programming model may help address the escalating software complexity issues resulting from the proliferation of many-core processor architectures in aerospace and computing systems in general. This paper presents our experiences with Unified parallel C (UPC), a PGAS language, on the Tile64™ processor, a 64-core processor from Tilera Corporation. We ported Berkeley UPC compiler and runtime system on the Tilera architecture and evaluated two separate runtime implementation conduits of the underlying GASNet communication library, a pThreads based conduit and an MPI based conduit. Each conduit uses different on-chip, inter-core communication networks providing different latencies and bandwidths for inter-process communications. The paper presents the implementation details and empirical analyses of both approaches by comparing and evaluating results from NAS Parallel Benchmark suite. The analyses reveal various optimization opportunities based on specific many-core architectural features which are also discussed in the paper12.