Theoretical modeling of superscalar processor performance
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Analytic evaluation of shared-memory systems with ILP processors
Proceedings of the 25th annual international symposium on Computer architecture
Exploring Instruction-Fetch Bandwidth Requirement in Wide-Issue Superscalar Processors
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
An Analytical Solution for a Markov Chain Modeling Multithreaded
An Analytical Solution for a Markov Chain Modeling Multithreaded
A First-Order Superscalar Processor Model
Proceedings of the 31st annual international symposium on Computer architecture
Merge: a programming model for heterogeneous multi-core systems
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Face detection using spectral histograms and SVMs
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Modeling GPU-CPU workloads and systems
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 24th ACM International Conference on Supercomputing
An integrated GPU power and performance model
Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
PacketShader: a GPU-accelerated software router
Proceedings of the ACM SIGCOMM 2010 conference
An analytical network performance model for SIMD processor CSX600 interconnects
Journal of Systems Architecture: the EUROMICRO Journal
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining
Proceedings of the VLDB Endowment
Sponge: portable stream programming on graphics engines
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A framework for dynamically instrumenting GPU compute applications within GPU Ocelot
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
An idiom-finding tool for increasing productivity of accelerators
Proceedings of the international conference on Supercomputing
Balance principles for algorithm-architecture co-design
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Bounding the effect of partition camping in GPU kernels
Proceedings of the 8th ACM International Conference on Computing Frontiers
Model-driven tile size selection for DOACROSS loops on GPUs
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Optimization strategies in different CUDA architectures using llCoMP
Microprocessors & Microsystems
Auto-tuning interactive ray tracing using an analytical GPU architecture model
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Break down GPU execution time with an analytical method
Proceedings of the 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
A case for coordinated resource management in heterogeneous multicore platforms
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
A unified optimizing compiler framework for different GPGPU architectures
ACM Transactions on Architecture and Code Optimization (TACO)
BSArc: blacksmith streaming architecture for HPC accelerators
Proceedings of the 9th conference on Computing Frontiers
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
Heterogeneous systems for energy efficient scientific computing
ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Characterizing and improving the use of demand-fetched caches in GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Performance models for asynchronous data transfers on consumer Graphics Processing Units
Journal of Parallel and Distributed Computing
Feedback-Based global instruction scheduling for GPGPU applications
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Power and performance analysis of GPU-accelerated systems
HotPower'12 Proceedings of the 2012 USENIX conference on Power-Aware Computing and Systems
Dataflow-driven GPU performance projection for multi-kernel transformations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GPURoofline: a model for guiding performance optimizations on GPUs
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An insightful program performance tuning chain for GPU computing
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Regional cache organization for NoC based many-core processors
Journal of Computer and System Sciences
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Power and Performance Management of GPUs Based Cluster
International Journal of Cloud Applications and Computing
Optimizing tensor contraction expressions for hybrid CPU-GPU execution
Cluster Computing
Influence of memory access patterns to small-scale FFT performance
The Journal of Supercomputing
Fast on-line statistical learning on a GPGPU
AusPDC '11 Proceedings of the Ninth Australasian Symposium on Parallel and Distributed Computing - Volume 118
Scaling large-data computations on multi-GPU accelerators
Proceedings of the 27th international ACM conference on International conference on supercomputing
Performance characterization of data-intensive kernels on AMD Fusion architectures
Computer Science - Research and Development
GPU-CC: a reconfigurable GPU architecture with communicating cores
Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Neither more nor less: optimizing thread-level parallelism for GPGPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Starchart: hardware and software optimization using recursive partitioning regression trees
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
A measurement study of GPU DVFS on energy conservation
Proceedings of the Workshop on Power-Aware Computing and Systems
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
GPU code generation for ODE-based applications with phased shared-data access patterns
ACM Transactions on Architecture and Code Optimization (TACO)
An efficient compiler framework for cache bypassing on GPUs
Proceedings of the International Conference on Computer-Aided Design
A memory access model for highly-threaded many-core architectures
Future Generation Computer Systems
Power Modeling for Heterogeneous Processors
Proceedings of Workshop on General Purpose Processing Using GPUs
CPU+GPU scheduling with asymptotic profiling
Parallel Computing
Hi-index | 0.00 |
GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.