An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Authors:
Sunpyo Hong;Hyesoon Kim
Affiliations:
Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA
Venue:
Proceedings of the 36th annual international symposium on Computer architecture
Year:
2009

Citing 12
Cited 59

Theoretical modeling of superscalar processor performance

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Analytic evaluation of shared-memory systems with ILP processors

Proceedings of the 25th annual international symposium on Computer architecture
Exploring Instruction-Fetch Bandwidth Requirement in Wide-Issue Superscalar Processors

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
An Analytical Solution for a Markov Chain Modeling Multithreaded

An Analytical Solution for a Markov Chain Modeling Multithreaded
A First-Order Superscalar Processor Model

Proceedings of the 31st annual international symposium on Computer architecture
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Scalable Parallel Programming with CUDA

Queue - GPU Computing
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Face detection using spectral histograms and SVMs

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Modeling GPU-CPU workloads and systems

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization

Proceedings of the 24th ACM International Conference on Supercomputing
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
PacketShader: a GPU-accelerated software router

Proceedings of the ACM SIGCOMM 2010 conference
An analytical network performance model for SIMD processor CSX600 interconnects

Journal of Systems Architecture: the EUROMICRO Journal
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining

Proceedings of the VLDB Endowment
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A framework for dynamically instrumenting GPU compute applications within GPU Ocelot

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
An idiom-finding tool for increasing productivity of accelerators

Proceedings of the international conference on Supercomputing
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Bounding the effect of partition camping in GPU kernels

Proceedings of the 8th ACM International Conference on Computing Frontiers
Model-driven tile size selection for DOACROSS loops on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Performance evaluation of the three-dimensional finite-difference time-domain(FDTD) method on Fermi architecture GPUs

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Optimization strategies in different CUDA architectures using llCoMP

Microprocessors & Microsystems
Auto-tuning interactive ray tracing using an analytical GPU architecture model

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Break down GPU execution time with an analytical method

Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
A case for coordinated resource management in heterogeneous multicore platforms

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
A unified optimizing compiler framework for different GPGPU architectures

ACM Transactions on Architecture and Code Optimization (TACO)
BSArc: blacksmith streaming architecture for HPC accelerators

Proceedings of the 9th conference on Computing Frontiers
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
Heterogeneous systems for energy efficient scientific computing

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Characterizing and improving the use of demand-fetched caches in GPUs

Proceedings of the 26th ACM international conference on Supercomputing
Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach

Parallel Computing
Performance models for asynchronous data transfers on consumer Graphics Processing Units

Journal of Parallel and Distributed Computing
Feedback-Based global instruction scheduling for GPGPU applications

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Power and performance analysis of GPU-accelerated systems

HotPower'12 Proceedings of the 2012 USENIX conference on Power-Aware Computing and Systems
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GPURoofline: a model for guiding performance optimizations on GPUs

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An insightful program performance tuning chain for GPU computing

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Regional cache organization for NoC based many-core processors

Journal of Computer and System Sciences
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Power and Performance Management of GPUs Based Cluster

International Journal of Cloud Applications and Computing
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Influence of memory access patterns to small-scale FFT performance

The Journal of Supercomputing
Fast on-line statistical learning on a GPGPU

AusPDC '11 Proceedings of the Ninth Australasian Symposium on Parallel and Distributed Computing - Volume 118
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Performance characterization of data-intensive kernels on AMD Fusion architectures

Computer Science - Research and Development
GPU-CC: a reconfigurable GPU architecture with communicating cores

Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems
Automatic synthesis of physical system differential equation models to a custom network of general processing elements on FPGAs

ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Starchart: hardware and software optimization using recursive partitioning regression trees

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A measurement study of GPU DVFS on energy conservation

Proceedings of the Workshop on Power-Aware Computing and Systems
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
GPU code generation for ODE-based applications with phased shared-data access patterns

ACM Transactions on Architecture and Code Optimization (TACO)
An efficient compiler framework for cache bypassing on GPUs

Proceedings of the International Conference on Computer-Aided Design
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
Power Modeling for Heterogeneous Processors

Proceedings of Workshop on General Purpose Processing Using GPUs
CPU+GPU scheduling with asymptotic profiling

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.