Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
IEEE Transactions on Parallel and Distributed Systems
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Parallel Implementation of the 2D Wavelet Transform Using CUDA
PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Parallel data-locality aware stencil computations on modern micro-architectures
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization
COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
Minimizing communication in sparse matrix solvers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
OMPCUDA: OpenMP execution framework for CUDA based on omni OpenMP compiler
IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Hi-index | 0.00 |
The move to multicore processors creates new demands on software development in order to profit from the improved capabilities. Most important, algorithm and code must be parallelized wherever possible, but also the growing memory wall must be considered. Additionally, high computational performance can only be reached if architecture-specific features are made use of. To address this complexity, we developed a C++ framework that simplifies the development of performance-optimized, parallel, memory-efficient, stencil-based codes on standard multicore processors and the heterogeneous Cell processor developed jointly by Sony, Toshiba, and IBM. We illustrate the implementation and optimization of the Fast Wavelet Transform and its inverse for Haar wavelets within our hybrid framework, using OpenMP, and using the Open Compute Language, and analyze performance results for different platforms.