Roofline: an insightful visual performance model for multicore architectures

Authors:
Samuel Williams;Andrew Waterman;David Patterson
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA;University of California, Berkeley;University of California, Berkeley
Venue:
Communications of the ACM - A Direct Path to Dependable Software
Year:
2009

Citing 19
Cited 66

Quantitative system performance: computer system analysis using queueing network models

Quantitative system performance: computer system analysis using queueing network models
Analytic Queueing Network Models for Parallel Processing of Task Systems

IEEE Transactions on Computers
Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
Analyzing the behavior and performance of parallel programs

Analyzing the behavior and performance of parallel programs
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Latency lags bandwith

Communications of the ACM - Voting systems
Mapping computational concepts to GPUs

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
A Hierarchical Approach to Modeling and Improving the Performance of Scientific Applications on the KSR1

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 03
Performance of Synchronized Iterative Processes in Multiprocessor Systems

IEEE Transactions on Software Engineering
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A genetic algorithms approach to modeling the performance of memory-bound computations

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Amdahl's Law in the Multicore Era

Computer
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
Auto-tuning performance on multicore computers

Auto-tuning performance on multicore computers

Evaluating multi-core platforms for HPC data-intensive kernels

Proceedings of the 6th ACM conference on Computing frontiers
Using many-core hardware to correlate radio astronomy signals

Proceedings of the 23rd international conference on Supercomputing
A view of the parallel computing landscape

Communications of the ACM - A View of Parallel Computing
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Journal of Parallel and Distributed Computing
Performance tuning and analysis of future vector processors based on the roofline model

Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
SCAMPI: a scalable CAM-based algorithm for multiple pattern inspection

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
High-throughput Bayesian network learning using heterogeneous multicore computers

Proceedings of the 24th ACM International Conference on Supercomputing
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Finding an upper bound on the increase in execution time due to contention on the memory bus in COTS-based multicore systems

ACM SIGBED Review - Special Issue on the Work-in-Progress (WIP) Session at the 2009 IEEE Real-Time Systems Symposium (RTSS)
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A case for machine learning to optimize multicore performance

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Design principles for end-to-end multicore schedulers

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Understanding Application Performance via Micro-benchmarks on Three Large Supercomputers: Intrepid, Ranger and Jaguar

International Journal of High Performance Computing Applications
Prototype implementation of array-processor extensible over multiple FPGAs for scalable stencil computation

ACM SIGARCH Computer Architecture News
Verification of printer datapaths using timed automata

ISoLA'10 Proceedings of the 4th international conference on Leveraging applications of formal methods, verification, and validation - Volume Part II
Performance engineering: a must for petascale and beyond

Proceedings of the third international workshop on Large-scale system and application performance
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
What Hill-Marty model learn from and break through Amdahl's law?

Information Processing Letters
Performance evaluations of gyrokinetic Eulerian code GT5D on massively parallel multi-core platforms

State of the Practice Reports
Performance modeling for systematic performance tuning

State of the Practice Reports
World-highest resolution global atmospheric model and its performance on the Earth Simulator

State of the Practice Reports
Tiled QR factorization algorithms

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
CudaDMA: optimizing GPU memory bandwidth via warp specialization

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids

SIAM Journal on Scientific Computing
Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation

ACM SIGARCH Computer Architecture News
GPU and APU computations of Finite Time Lyapunov Exponent fields

Journal of Computational Physics
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
The boat hull model: adapting the roofline model to enable performance prediction for parallel computing

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Portable parallel performance from sequential, productive, embedded domain-specific languages

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
The boat hull model: enabling performance prediction for parallel computing prior to code development

Proceedings of the 9th conference on Computing Frontiers
Domain-Specific language and compiler for stencil computation on FPGA-Based systolic computational-memory array

ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
A polyphase filter for GPUs and multi-core processors

Proceedings of the 2012 workshop on High-Performance Computing for Astronomy Date
An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Journal of Computational Physics
Parallelization of EULAG model on multicore architectures with GPU accelerators

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
Power-aware multi-core simulation for early design stage hardware/software co-optimization

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
The impact of heterogeneous multi-core clusters on graph partitioning: an empirical study

Cluster Computing
High throughput software for direct numerical simulations of compressible two-phase flows

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Aspen: a domain specific language for performance modeling

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Algorithmic species: A classification of affine loop nests for parallel programming

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
GPURoofline: a model for guiding performance optimizations on GPUs

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An insightful program performance tuning chain for GPU computing

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
How much (execution) time and energy does my algorithm cost?

XRDS: Crossroads, The ACM Magazine for Students - Scientific Computing
Performance and toolchain of a combined GPU/FPGA desktop (abstract only)

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Diagnosis and optimization of application prefetching performance

Proceedings of the 27th international ACM conference on International conference on supercomputing
Performance analysis and prediction for distributed homogeneous clusters

Computer Science - Research and Development
Future of GPGPU micro-architectural parameters

Proceedings of the Conference on Design, Automation and Test in Europe
Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators

ACM Transactions on Computer Systems (TOCS)
Modeling and predicting performance of high performance computing applications on hardware accelerators

International Journal of High Performance Computing Applications
Solving the compressible navier-stokes equations on up to 1.97 million cores and 4.1 trillion grid points

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A framework for hybrid parallel flow simulations with a trillion cells in complex geometries

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Using automated performance modeling to find scalability bugs in complex codes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
11 PFLOP/s simulations of cloud cavitation collapse

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Assessing the performance of OpenMP programs on the intel xeon phi

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
On the GPU performance of cell-centered finite volume method over unstructured tetrahedral meshes

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Towards making autotuning mainstream

International Journal of High Performance Computing Applications
Roofline-aware DVFS for GPUs

Proceedings of International Workshop on Adaptive Self-tuning Computing Systems
Optimizing convolution operations on GPUs using adaptive tiling

Future Generation Computer Systems
An application-centric evaluation of OpenCL on multi-core CPUs

Parallel Computing
Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs

Proceedings of Workshop on General Purpose Processing Using GPUs
Performance modeling for FPGAs: extending the roofline model with high-level synthesis tools

International Journal of Reconfigurable Computing

Quantified Score

Hi-index	0.02

Visualization

Abstract

The Roofline model offers insight on how to improve the performance of software and hardware.