An idiom-finding tool for increasing productivity of accelerators

Authors:
Laura Carrington;Mustafa M. Tikir;Catherine Olschanowsky;Michael Laurenzano;Joshua Peraza;Allan Snavely;Stephen Poole
Affiliations:
UCSD/SDSC, La Jolla, CA, USA;Google Inc., Mountain View, CA, USA;UCSD/SDSC, La Jolla, CA, USA;UCSD/SDSC, La Jolla, CA, USA;UCSD/SDSC, La Jolla, CA, USA;UCSD/SDSC, La Jolla, CA, USA;ORNL, Oak Ridge, TN, USA
Venue:
Proceedings of the international conference on Supercomputing
Year:
2011

Citing 41
Cited 4

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
The impact of hardware gather/scatter on sparse Gaussian elimination

SIAM Journal on Scientific and Statistical Computing
Radix sort for vector multiprocessors

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
A static parameter based performance prediction tool for parallel programs

ICS '93 Proceedings of the 7th international conference on Supercomputing
Analytical performance prediction on multicomputers

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Efficient support for irregular applications on distributed-memory machines

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
An integrated compilation and performance analysis environment for data parallel programs

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Multiprocessor scalability predictions through detailed program execution analysis

ICS '95 Proceedings of the 9th international conference on Supercomputing
Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
Performance Models for the Processor Farm Paradigm

IEEE Transactions on Parallel and Distributed Systems
Automated performance prediction for scalable parallel computing

Parallel Computing
Adaptive performance prediction for distributed data-intensive applications

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Demonstrating the scalability of a molecular dynamics application on a Petaflop computer

ICS '01 Proceedings of the 15th international conference on Supercomputing
Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
The Paradyn Parallel Performance Measurement Tool

Computer
Symbolic Performance Modeling of Parallel Systems

IEEE Transactions on Parallel and Distributed Systems
Performance Forecasting: Towards a Methodology for Characterizing Large Computational Applications

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Implementation Lessons of Performance Prediction Tool for Parallel Conservative Simulation (Research Note)

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Performance Prediction of an NAS Benchmark Program with ChronosMix Environment

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Accurate Performance Prediction for Assively Parallel Systems and Its Applications

Euro-Par '96 Proceedings of the Second International Euro-Par Conference on Parallel Processing-Volume II
A framework for performance modeling and prediction

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Cross-architecture performance predictions for scientific applications using parameterized models

Proceedings of the joint international conference on Measurement and modeling of computer systems
HPC Productivity: An Overarching View

International Journal of High Performance Computing Applications
Application Representations for Multiparadigm Performance Modeling of Large-Scale Parallel Scientific Codes

International Journal of High Performance Computing Applications
Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
How Well Can Simple Metrics Represent the Performance of HPC Applications?

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
The Tau Parallel Performance System

International Journal of High Performance Computing Applications
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Analyzing the Energy-Time Trade-Off in High-Performance Computing Applications

IEEE Transactions on Parallel and Distributed Systems
Efficient gather and scatter operations on graphics processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A genetic algorithms approach to modeling the performance of memory-bound computations

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Speeding up Nek5000 with autotuning and specialization

Proceedings of the 24th ACM International Conference on Supercomputing
PSINS: An Open Source Event Tracer and Execution Simulator

HPCMP-UGC '09 Proceedings of the 2009 DoD High Performance Computing Modernization Program Users Group Conference
A framework to develop symbolic performance models of parallel applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance modeling: understanding the past and predicting the future

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
PSnAP: accurate synthetic address streams through memory profiles

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
An exploration of performance attributes for symbolic modeling of emerging processing devices

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications

GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The boat hull model: enabling performance prediction for parallel computing prior to code development

Proceedings of the 9th conference on Computing Frontiers
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Algorithmic species: A classification of affine loop nests for parallel programming

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Suppose one is considering purchase of a computer equipped with accelerators. Or suppose one has access to such a computer and is considering porting code to take advantage of the accelerators. Is there a reason to suppose the purchase cost or programmer effort will be worth it? It would be nice to able to estimate the expected improvements in advance of paying money or time. We exhibit an analytical framework and tool-set for providing such estimates: the tools first look for user-defined idioms that are patterns of computation and data access identified in advance as possibly being able to benefit from accelerator hardware. A performance model is then applied to estimate how much faster these idioms would be if they were ported and run on the accelerators, and a recommendation is made as to whether or not each idiom is worth the porting effort to put them on the accelerator and an estimate is provided of what the overall application speedup would be if this were done. As a proof-of-concept we focus our investigations on Gather/Scatter (G/S) operations and means to accelerate these available on the Convey HC-1 which has a special-purpose "personality" for accelerating G/S. We test the methodology on two large-scale HPC applications. The idiom recognizer tool saves weeks of programmer effort compared to having the programmer examine the code visually looking for idioms; performance models save yet more time by rank-ordering the best candidates for porting; and the performance models are accurate, predicting G/S runtime speedup resulting from porting to within 10% of speedup actually achieved. The G/S hardware on the Convey sped up these operations 20x, and the overall impact on total application runtime was to improve it by as much as 21%.