Towards autotuning by alternating communication methods

Authors:
Adrian Tineo;Sadaf R. Alam;Thomas C. Schulthess
Affiliations:
Swiss National Supercomputing Centre, Switzerland;Swiss National Supercomputing Centre, Switzerland;Swiss National Supercomputing Centre, Switzerland
Venue:
ACM SIGMETRICS Performance Evaluation Review
Year:
2012

Citing 10
Cited 1

Optimizing All-to-All Collective Communication by Exploiting Concurrency in Modern Networks

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Automatic nonblocking communication for partitioned global address space programs

Proceedings of the 21st annual international conference on Supercomputing
Optimizing a conjugate gradient solver with non-blocking collective operations

Parallel Computing
Sorting networks and their applications

AFIPS '68 (Spring) Proceedings of the April 30--May 2, 1968, spring joint computer conference
MPI-aware compiler optimizations for improving communication-computation overlap

Proceedings of the 23rd international conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Hybrid parallel programming with MPI and unified parallel C

Proceedings of the 7th ACM international conference on Computing frontiers
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Introducing OpenSHMEM: SHMEM for the PGAS community

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Encyclopedia of Parallel Computing

Encyclopedia of Parallel Computing

Towards autotuning by alternating communication methods

Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Interconnects in emerging high performance computing systems feature hardware support for one-sided, asynchronous communication and global address space programming models in order to improve parallel efficiency and productivity by allowing communication and computation overlap and outof- order delivery. In practice though, complex interactions between the software stack and the communication hardware make it challenging to obtain optimum performance for a full application expressed with a one-sided programming paradigm. Here, we present a proof-of-concept study for an autotuning framework that instantiates hybrid kernels based on refactored codes using available communication libraries or languages on a Cray XE6 and a SGI Altix UV 1000. We validate our approach by improving performance for bandwidth- and latency-bound kernels of interest in quantum physics and astrophysics by up to 35% and 80% respectively.