Rodinia: A benchmark suite for heterogeneous computing

Authors:
Shuai Che;Michael Boyer;Jiayuan Meng;David Tarjan;Jeremy W. Sheaffer;Sang-Ha Lee;Kevin Skadron
Affiliations:
Department of Computer Science, University of Virginia, USA;Department of Computer Science, University of Virginia, USA;Department of Computer Science, University of Virginia, USA;Department of Computer Science, University of Virginia, USA;Department of Computer Science, University of Virginia, USA;Department of Computer Science, University of Virginia, USA;Department of Computer Science, University of Virginia, USA
Venue:
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Year:
2009

Citing 0
Cited 126

Modeling GPU-CPU workloads and systems

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
The Scalable Heterogeneous Computing (SHOC) benchmark suite

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Throughput-Effective On-Chip Networks for Manycore Accelerators

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Power and Performance Characterization of Computational Kernels on the GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Quantifying NUMA and contention effects in multi-GPU systems

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
CnC-CUDA: declarative programming for GPUs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Considering GPGPU for HPC centers: is it worth the effort?

Facing the multicore-challenge
Considering GPGPU for HPC centers: is it worth the effort?

Facing the multicore-challenge
MDR: performance model driven runtime for heterogeneous parallel platforms

Proceedings of the international conference on Supercomputing
SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading

Proceedings of the 38th annual international symposium on Computer architecture
Automatic OpenCL device characterization: guiding optimized kernel design

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Evaluation of an accelerator architecture for speckle reducing anisotropic diffusion

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exploring the limits of GPGPU scheduling in control flow bound applications

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Seamlessly portable applications: Managing the diversity of modern heterogeneous systems

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Identifying hotspots in a program for data parallel architecture: an early experience

Proceedings of the 5th India Software Engineering Conference
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Poster: determining code segments that can benefit from execution on GPUs

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Improving GPU performance via large warps and two-level warp scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A compile-time managed multi-level register file hierarchy

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing off-chip memory traffic by selective cache management scheme in GPGPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

ACM Transactions on Computer Systems (TOCS)
The "Chimera": an off-the-shelf CPU/GPGPU/FPGA hybrid computing platform

International Journal of Reconfigurable Computing - Special issue on High-Performance Reconfigurable Computing
A methodology for energy-quality tradeoff using imprecise hardware

Proceedings of the 49th Annual Design Automation Conference
Characterization and transformation of unstructured control flow in bulk synchronous GPU applications

International Journal of High Performance Computing Applications
Thermal management of a many-core processor under fine-grained parallelism

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Boosting single thread performance in mobile processors via reconfigurable acceleration

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Dynamic binary rewriting and migration for shared-ISA asymmetric, multicore processors: summary

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Characterizing and improving the use of demand-fetched caches in GPUs

Proceedings of the 26th ACM international conference on Supercomputing
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation

Proceedings of the 26th ACM international conference on Supercomputing
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Energy-efficient GPU design with reconfigurable in-package graphics memory

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
A systematic process for efficient execution on Intel's heterogeneous computation nodes

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Simultaneous branch and warp interweaving for sustained GPU performance

Proceedings of the 39th Annual International Symposium on Computer Architecture
CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Proceedings of the 39th Annual International Symposium on Computer Architecture
Performance and productivity of new programming languages

Facing the Multicore-Challenge II
Parakeet: a just-in-time parallel accelerator for python

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Gdev: first-class GPU resource management in the operating system

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
SPEC OMP2012 -- an application benchmark suite for parallel systems using OpenMP

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Feedback-Based global instruction scheduling for GPGPU applications

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Power-aware multi-core simulation for early design stage hardware/software co-optimization

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
RISE: improving the streaming processors reliability against soft errors in gpgpus

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Workload and power budget partitioning for single-chip heterogeneous processors

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Early evaluation of directive-based GPU programming models for productive exascale computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ValuePack: value-based scheduling framework for CPU-GPU clusters

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Spill code placement for SIMD machines

SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration

Proceedings of the International Conference on Computer-Aided Design
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

International Journal of Computational Science and Engineering
Automatic problem size sensitive task partitioning on heterogeneous parallel systems

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Data layout optimization for GPGPU architectures

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Prius: a runtime for hybrid computing

Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores
Fast dynamic binary rewriting to support thread migration in shared-ISA asymmetric multicores

Proceedings of the First International Workshop on Code OptimiSation for MultI and many Cores
Efficient design space exploration of GPGPU architectures

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Inter-warp instruction temporal locality in deep-multithreaded GPUs

ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
GPUDet: a deterministic GPU architecture

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Improving GPGPU concurrency with elastic kernels

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Cache-Conscious Wavefront Scheduling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Warp size impact in GPUs: large or small?

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
An automatic input-sensitive approach for heterogeneous task partitioning

Proceedings of the 27th international ACM conference on International conference on supercomputing
Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement

Proceedings of the 27th international ACM conference on International conference on supercomputing
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Portable mapping of openMP to multicore embedded systems using MCA APIs

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Performance characterization of data-intensive kernels on AMD Fusion architectures

Computer Science - Research and Development
GPU acceleration of regular expression matching for large datasets: exploring the implementation space

Proceedings of the ACM International Conference on Computing Frontiers
Load balancing in a changing world: dealing with heterogeneity and performance variability

Proceedings of the ACM International Conference on Computing Frontiers
RFiof: an RF approach to I/O-pin and memory controller scalability for off-chip memories

Proceedings of the ACM International Conference on Computing Frontiers
Cost-effective soft-error protection for SRAM-based structures in GPGPUs

Proceedings of the ACM International Conference on Computing Frontiers
Using synchronization stalls in power-aware accelerators

Proceedings of the Conference on Design, Automation and Test in Europe
Characterizing the performance benefits of fused CPU/GPU systems using FusionSim

Proceedings of the Conference on Design, Automation and Test in Europe
Microarchitectural mechanisms to exploit value structure in SIMT architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Exploring memory consistency for massively-threaded throughput-oriented processors

Proceedings of the 40th Annual International Symposium on Computer Architecture
Cooperative boosting: needy versus greedy power management

Proceedings of the 40th Annual International Symposium on Computer Architecture
Orchestrated scheduling and prefetching for GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Proceedings of the 40th Annual International Symposium on Computer Architecture
SIMD divergence optimization through intra-warp compaction

Proceedings of the 40th Annual International Symposium on Computer Architecture
GPUWattch: enabling energy optimizations in GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Criticality stacks: identifying critical threads in parallel programs using synchronization behavior

Proceedings of the 40th Annual International Symposium on Computer Architecture
Coordinated energy management in heterogeneous processors

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A data-centric profiler for parallel programs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Barrier invariants: a shared state abstraction for the analysis of data-dependent GPU kernels

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Designing on-chip networks for throughput accelerators

ACM Transactions on Architecture and Code Optimization (TACO)
Memory performance estimation of CUDA programs

ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
DANBI: dynamic scheduling of irregular stream programs for many-core systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Starchart: hardware and software optimization using recursive partitioning regression trees

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
A measurement study of GPU DVFS on energy conservation

Proceedings of the Workshop on Power-Aware Computing and Systems
A sound and complete abstraction for reasoning about parallel prefix sums

Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
Scheduling concurrent applications on a cluster of CPU-GPU nodes

Future Generation Computer Systems
Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture

Journal of Parallel and Distributed Computing
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A locality-aware memory hierarchy for energy-efficient GPU architectures

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Divergence-aware warp scheduling

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Warped gates: gating aware scheduling and power gating for GPGPUs

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Heterogeneous system coherence for integrated CPU-GPU systems

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

ACM Transactions on Architecture and Code Optimization (TACO)
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Paraprox: pattern-based approximation for data parallel applications

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Rhythm: harnessing data parallel hardware for server workloads

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Heterogeneous-race-free memory models

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Portable and Transparent Host-Device Communication Optimization for GPGPU Environments

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Selecting representative benchmark inputs for exploring microprocessor design spaces

ACM Transactions on Architecture and Code Optimization (TACO)
Optimization power consumption model of reliability-aware GPU clusters

The Journal of Supercomputing
An efficient compiler framework for cache bypassing on GPUs

Proceedings of the International Conference on Computer-Aided Design
Dynamic load balancing on heterogeneous multi-GPU systems

Computers and Electrical Engineering
An application-centric evaluation of OpenCL on multi-core CPUs

Parallel Computing
HARP: Harnessing inactive threads in many-core processors

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
KMA: A Dynamic Memory Manager for OpenCL

Proceedings of Workshop on General Purpose Processing Using GPUs
Efficient Instrumentation of GPGPU Applications Using Information Flow Analysis and Symbolic Execution

Proceedings of Workshop on General Purpose Processing Using GPUs
Power Modeling for Heterogeneous Processors

Proceedings of Workshop on General Purpose Processing Using GPUs
A Walking Dwarf on the Clouds

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Boosting CUDA Applications with CPU---GPU Hybrid Computing

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.