Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator
ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
The art of computer programming, volume 3: (2nd ed.) sorting and searching
The art of computer programming, volume 3: (2nd ed.) sorting and searching
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Scientific Parallel Computing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model
Journal of Computational Physics
Improving particle filter performance using SSE instructions
IROS'09 Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
International Journal of High Performance Computing Applications
Efficient SIMD numerical interpolation
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Investigating the performance of an adiabatic quantum optimization processor
Quantum Information Processing
Performance potential for simulating spin models on GPU
Journal of Computational Physics
High throughput software for direct numerical simulations of compressible two-phase flows
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
11 PFLOP/s simulations of cloud cavitation collapse
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 31.45 |
Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version, indicating the importance of optimizing CPU implementations.