Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Toward techniques for auto-tuning GPU algorithms
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Hi-index | 0.00 |
The field of high-performance computing is highly dependent on increasingly complex computer architectures. Parallel computing has been the norm for decades, but hardware architectures like the Cell Broadband Engine (Cell/BE) and General Purpose GPUs (GPGPUs) introduce additional complexities and are difficult to program efficiently even for well-suited problems. Efficiency is taken to include both maximizing the performance of the software and minimizing the programming effort required. With the goal of exposing the challenges facing a domain scientist using these types of hardware, in this paper we discuss the implementation of a Monte Carlo simulation of a system of charged particles on the Cell/BE and for GPUs. We focus on Coulomb interactions because their long-range nature prohibits using cut-offs to reduce the number of calculations, making simulations very expensive. The goal was to encapsulate the computationally expensive component of the program in a way so as to be useful to domain researchers with legacy codes. Generality and flexibility were therefore just as important as performance. Using the GPU and Cell/BE library requires only small changes in the simulation codes we've seen and yields programs that run at or near the theoretical peak performance of the hardware.