Modeling GPU-CPU workloads and systems

Authors:
Andrew Kerr;Gregory Diamos;Sudhakar Yalamanchili
Affiliations:
Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA
Venue:
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Year:
2010

Citing 7
Cited 12

Multivariate statistical methods: a primer

Multivariate statistical methods: a primer
Designing Computer Architecture Research Workloads

Computer
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite

Proceedings of the 34th annual international symposium on Computer architecture
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
A characterization and analysis of PTX kernels

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)

Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Processing data streams with hard real-time constraints on heterogeneous systems

Proceedings of the international conference on Supercomputing
Toward techniques for auto-tuning GPU algorithms

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Scheduling processing of real-time data streams on heterogeneous multi-GPU systems

Proceedings of the 5th Annual International Systems and Storage Conference
ValuePack: value-based scheduling framework for CPU-GPU clusters

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic selection of processing units for coprocessing in databases

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Power and Performance Management of GPUs Based Cluster

International Journal of Cloud Applications and Computing
A large-scale cross-architecture evaluation of thread-coarsening

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient co-processor utilization in database query processing

Information Systems
Why it is time for a HyPE: a hybrid query processing engine for efficient GPU coprocessing in DBMS

Proceedings of the VLDB Endowment
Scheduling concurrent applications on a cluster of CPU-GPU nodes

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well documented system, it may not perform as well or even function on a different system. Developers who have less experience with either the application domain or the system architecture may devote a significant effort to writing a program that merely functions correctly. We believe that a comprehensive analysis and modeling frame-work is necessary to ease application development and automate program optimization on heterogeneous platforms. This paper reports on an empirical evaluation of 25 CUDA applications on four GPUs and three CPUs, leveraging the Ocelot dynamic compiler infrastructure which can execute and instrument the same CUDA applications on either target. Using a combination of instrumentation and statistical analysis, we record 37 different metrics for each application and use them to derive relationships between program behavior and performance on heterogeneous processors. These relationships are then fed into a modeling frame-work that attempts to predict the performance of similar classes of applications on different processors. Most significantly, this study identifies several non-intuitive relationships between program characteristics and demonstrates that it is possible to accurately model CUDA kernel performance using only metrics that are available before a kernel is executed.