Shared memory programming for large scale machines

  • Authors:
  • Christopher Barton;CĆlin Casçaval;George Almási;Yili Zheng;Montse Farreras;Siddhartha Chatterje;José Nelson Amaral

  • Affiliations:
  • University of Alberta, Edmonton, Canada;IBM T.J.Watson Research Center, Yorktown Heights, NY;IBM T.J.Watson Research Center, Yorktown Heights, NY;Purdue University, West Lafayette IN;Universitat Politecnica de Catalunya, Barcelona Spain;IBM T.J.Watson Research Center, Yorktown Heights, NY;University of Alberta, Edmonton, Canada

  • Venue:
  • Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes the design and implementation of a scalable run-time system and an optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on BlueGene/L®, a distributed-memory machine, demonstrates that the combination of the compiler with the runtime system produces programs with performance comparable to that of efficient MPI programs and good performance scalability up to hundreds of thousands of processors.Our runtime system design solves the problem of maintaining shared object consistency efficiently in a distributed memory machine. Our compiler infrastructure simplifies the code generated for parallel loops in UPC through the elimination of affinity tests, eliminates several levels of indirection for accesses to segments of shared arrays that the compiler can prove to be local, and implements remote update operations through a lower-cost asynchronous message. The performance evaluation uses three well-known benchmarks --- HPC RandomAccess, HPC STREAM and NAS CG --- to obtain scaling and absolute performance numbers for these benchmarks on up to 131072 processors, the full BlueGene/L machine. These results were used to win the HPC Challenge Competition at SC05 in Seattle WA, demonstrating that PGAS languages support both productivity and performance.