Importance of explicit vectorization for CPU and GPU software performance

Authors:
Neil G. Dickson;Kamran Karimi;Firas Hamze
Affiliations:
D-Wave Systems Inc., 100-4401 Still Creek Drive, Burnaby, British Columbia, Canada V5C 6G9;D-Wave Systems Inc., 100-4401 Still Creek Drive, Burnaby, British Columbia, Canada V5C 6G9;D-Wave Systems Inc., 100-4401 Still Creek Drive, Burnaby, British Columbia, Canada V5C 6G9
Venue:
Journal of Computational Physics
Year:
2011

Citing 13
Cited 3

Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
Scientific Parallel Computing

Scientific Parallel Computing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model

Journal of Computational Physics
Technical Section: Benchmarking and implementation of probability-based simulations on programmable graphics cards

Computers and Graphics
Improving particle filter performance using SSE instructions

IROS'09 Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems
Programming Massively Parallel Processors: A Hands-on Approach

Programming Massively Parallel Processors: A Hands-on Approach
High-performance Physics Simulations Using Multi-core CPUs and GPGPUs in a Volunteer Computing Context

International Journal of High Performance Computing Applications
Efficient SIMD numerical interpolation

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Investigating the performance of an adiabatic quantum optimization processor

Quantum Information Processing

Performance potential for simulating spin models on GPU

Journal of Computational Physics
High throughput software for direct numerical simulations of compressible two-phase flows

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
11 PFLOP/s simulations of cloud cavitation collapse

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	31.45

Visualization

Abstract

Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version, indicating the importance of optimizing CPU implementations.